ES vs. Sphinx

Gladius

Well-known member
Any direct comparisons so far? From what I've read, Sphinx is much more efficient in several (all?) aspects than ES so it makes me wonder why XF's gone with ES...
 
Yeah, I plan on it... there are still some features I want to add first... and also need to get it into the phrase system. I really just hacked it together in a few minutes as a diversion from what I'm *really* working on. :)

But yes... I will release it when I can.

That would be awesome, would be even more awesome if you could add in the number of searches, the queries, average fetch times and other relevent information to the daily statistics so we could compare how busy periods and loads compare with postings and the like :D
 
That would be awesome, would be even more awesome if you could add in the number of searches, the queries, average fetch times and other relevent information to the daily statistics so we could compare how busy periods and loads compare with postings and the like :D
Truthfully, doing daily stats is probably more work than I'm willing to do for this... You can do it already with Google Analytics if you want (and it does it really well)...

Image%202012.04.26%203:08:51%20AM.png
 
In your Analytics, go to Admin -> Profiles -> Profile Settings -> Site Search Settings

Enable it and set the query parameter to "q" (without quotes).

Oooh thankyou :) Are there any other nifty tricks you know? To be honest all i've ever used analytics for is to show our advertisers the levels of traffic we get, so haven't realy dug around for information like this.
 
Oooh thankyou :) Are there any other nifty tricks you know? To be honest all i've ever used analytics for is to show our advertisers the levels of traffic we get, so haven't realy dug around for information like this.
You can do some interesting stuff like tracking social interactions... but if you want stuff beyond Google+, it's a little tricky to set up since it requires some extra JavaScript...

http://xenforo.com/community/threads/better-google-analytics.21559
 
Sphinx is lighter weight than Elasticsearch (not by much), however Sphinx excels at simple queries and the need for the incremental indexes mean you have a dead time between an item being posted and being picked up.
From what I read recently, is not true. Elastic is a lot heavier on both resources and hardware. Personally, I don't think is fair to post this kind of information that makes Sphinx look like a cheap product. I worked with Sphinx for years and I know this product in and out.
Elasticsearch has the benefit of getting around this, as items are indexed in real time and is much more flexible when it comes to advanced queries.
So does Sphinx, with RT indices and a 10th of the resources normally used on Elastic.
Elasticsearch also comes with out the box support for distributed resources, and if one node goes down, it automatically shifts control to the other nodes, it also allows for similar documents to be compared and retrieved, something Sphinx cannot do.
So does Sphinx, with agents.
To add in, depending on your content type and mapping, ES can also have a lesser impact on your servers resources.
So far from what I read, not true. Elastic requires 150 times more resources, compared to Sphinx.
Quick example: a board with 35 million docs requires 70GB of RAM with Elastic, and only 3GB with Sphinx to produce similar results.
 
From what I read recently, is not true. Elastic is a lot heavier on both resources and hardware. Personally, I don't think is fair to post this kind of information that makes Sphinx look like a cheap product. I worked with Sphinx for years and I know this product in and out.

So does Sphinx, with RT indices and a 10th of the resources normally used on Elastic.

So does Sphinx, with agents.

So far from what I read, not true. Elastic requires 150 times more resources, compared to Sphinx.
Quick example: a board with 35 million docs requires 70GB of RAM with Elastic, and only 3GB with Sphinx to produce similar results.

At the time of posting on the testbed I was using that information was pretty accurate. (which was compairing a beta version of ES and a user created sphinx search for xf)

In light of now ramping that up to much larger scales my previous post was you pointed out was mostly incorrect.
 
At the time of posting on the testbed I was using that information was pretty accurate. (which was compairing a beta version of ES and a user created sphinx search for xf)

In light of now ramping that up to much larger scales my previous post was you pointed out was mostly incorrect.
I thought so and I appreciate your input. I just wanted to make sure people are aware.
Okay... I have a *little* more experience with ES now... And the more I play with it, the more I like it. It does some things that I wish Sphinx would do... for example it's ability to auto-shard and replicate to other nodes (servers) is pretty seamless and awesome.
Well, for starters, you would not need multiple shards with Sphinx. OK, if you need a setup like Craiglist with 3 billion documents, just use a mix of real life and archive indices with a master/slave scheme. Extremely easy to maintain, they currently use 2 masters and 8 slaves to serve their search data. I'm looking forward to see you posting real time specs and Elastic results from your 20 million documents database.
 
I thought so and I appreciate your input. I just wanted to make sure people are aware.

Well, for starters, you would not need multiple shards with Sphinx. OK, if you need a setup like Craiglist with 3 billion documents, just use a mix of real life and archive indices with a master/slave scheme. Extremely easy to maintain, they currently use 2 masters and 8 slaves to serve their search data. I'm looking forward to see you posting real time specs and Elastic results from your 20 million documents database.

Would also like to see DP run some comparisons, similar to http://zooie.wordpress.com/2009/07/...n-source-search-engines-and-indexing-twitter/
 
I know this is old, but I wanted to post some data now that I actually have a populates ES setup and compare it to Sphinx...

Both our Sphinx and ES setup have more than 20M documents (beyond posts and threads, we also have users, conversations, user notes, marketplace items, etc. as searchable content types in both systems)...

With Sphinx, the index takes 4.16GB. With Elastic Search it takes 7.63GB (same data... same posts, users, etc.)

ES takes quite a bit more space by default... also we don't have it indexing the _source... (more on that over here). The only other real change to ES config we made was the one over here: http://xenforo.com/community/resources/change-analyzer-for-enhanced-search.643/

On a cold start, ES is slower doing searches... once it's "warmed up" and indexes are in memory, both ES and Sphinx are more or less instant (ES has a Warmpup API that is part of the next version [v0.20], which will hopefully make this a non-issue).

I like Sphinx and have worked with it extensively and done some fairly advanced stuff with it... that being said, there are some things about ES that I really like... I like that it's schema-free (like if you add a new index or something, you don't need to define it and restart anything)... You can do distributed indexes with Sphinx, but ES's sharding/auto-allocation system is really nice when dealing with more than 1 server. In addition, any ES node can handle writes, which are automatically distributed to all other nodes without any special setup/config. Say you had 10 web servers, and you ran ES on all 10 web servers, the system will just distribute parts (shards) of the indexes around as needed and web servers could just use "localhost" as it's ES server IP. Bring online a new server or take one offline, and it automatically redistributes shards as needed instantly.

That being said... ES flat out uses about twice the memory that Sphinx does. I really hate the fact that it's a Java app, rather than something natively compiled.

Some general stats from admin.php:

Image%202012.12.03%203:37:38%20PM.png


Index Size and Documents are double what the *useable* numbers are since we have 2 copies of everything. Index Time is much higher than it actually is, since I've reindexed everything a few times.

Oh, that reminds me... Sphinx is about 6x faster at bulk indexing... but that's a one-time thing, so I guess it's not TOO terrible. For myself, I ended up making a CLI-based reindexer for Elastic Search that has all nodes working on the full reindex in parallel (this means if I have 10 nodes online doing a full reindex of 20M+ documents, it takes about 14 minutes).

Overall, I don't prefer one over the other at this point... I wish there was a way to take the good parts of each and merge them into a single product... Smaller index sizes, faster bulk indexing and native compiled app with Sphinx, schema-less design and distributed shards from ES.
 
So I decided to do some testing with my multi-threaded reindexer and ran a full index a couple times. Just to see what the difference was with the exact same data, I did it once *with* _source (XF includes it in ES index by default) and once without. It's a pointless bit of data... it just stores the original document you are indexing.

WITH _source included, the index was 14.7GB, without it, the index ended up being 7.09GB... so less than half the size just by removing the unused _source field.
 
Quick example: a board with 35 million docs requires 70GB of RAM with Elastic, and only 3GB with Sphinx to produce similar results.

It's too much, Floren.
My forum have 27m posts. 8GB is too enough to run Elastic. Server Load is around < 0.x.

What I don't like with Elastic is Indexing time. It's take my day to reindex my 2xm posts board. I'm using Google beside Elastic too (better search result).
 
So I decided to do some testing with my multi-threaded reindexer and ran a full index a couple times. Just to see what the difference was with the exact same data, I did it once *with* _source (XF includes it in ES index by default) and once without. It's a pointless bit of data... it just stores the original document you are indexing.

WITH _source included, the index was 14.7GB, without it, the index ended up being 7.09GB... so less than half the size just by removing the unused _source field.

How is elastic doing for you after a day of running it live?
 
About the same... it definitely is faster now that real people are actually using it (warming up the data I'm sure). I'd probably *really* like it if it wasn't built on top of Java.

I just made my 4 web servers also be ES nodes for sake of simplicity (just using localhost for the IP) with 2 copies of the data being stored for fault tolerance. Total ES data size is for one copy of all the data is 8.1GB (21,894,046 searchable documents), so each node is serving up about 4GB of ES data.

But yeah, I don't like it any more or any less... it is what it is... the sharding/nodes, stuff in it is great. I just hate it being a Java app. lol
 
Heh, I've always had a "two steps forward, one step back" feeling with Java too. Congrats on switching over to XF though, that must have been an insane amount of work for you. Any plans on sharing what you've done with XF on your site as a more complete package? I'd gladly pay for it.
 
Heh, I've always had a "two steps forward, one step back" feeling with Java too. Congrats on switching over to XF though, that must have been an insane amount of work for you. Any plans on sharing what you've done with XF on your site as a more complete package? I'd gladly pay for it.
Some stuff will be shared for sure... but it's not going to be a, "Here's everything... go make a clone of my site now." situation... :)

The bigger stuff really has no way of being shared anyway... like Digital Point Ads is an entire advertising platform, so unless you have the infrastructure to support being a mini Google AdWords/AdSense.. :) But if you really want something like that, just ask Google for a copy of theirs. ;)
 
Of course, I meant in terms of general XF feature additions and things like ES stats (don't know if you've ever released that separately) since I assume you've added a boatload of stuff that was in vB but not in XF. That's really the only thing holding me back from switching a few big boards to XF, especially considering that development may or may not ever pick up here again.
 
Well one thing you might be interested in if you are doing big board migrations is I'm going to make the import system I used available... but it's really going to be a starting point... all boards are different, so it's going to need to be customized for whatever site you are doing an import on. So it's really going to be a, "take it... unsupported completely... you will need a programmer to make this work for YOU." type of deal... but the end result is you could do like a 10M post forum in about 30 minutes. :)
 
Top Bottom