IGN's ElasticSearch _mapping

Mike Tougeron · Jan 20, 2012

At IGN we use the following _mapping for our ElasticSearch index. For about ~3 million messages this reduced the size of our index by > 30%.

Code:

curl -XPUT 'http://localhost:9200/xenforo_ign/post/_mapping' -d '
{
    "post" : {
        "_source" : {
            "enabled" : false
        },
        "properties" : {
            "message" : {"type" : "string", "store" : "no"},
            "title" : {"type" : "string", "store" : "no", "index" : "no"},
            "date" : {"type" : "long", "store" : "yes"},
            "user" : {"type" : "long", "store" : "yes"},
            "discussion_id" : {"type" : "long", "store" : "yes"},
            "node" : {"type" : "long", "store" : "no"},
            "prefix" : {"type" : "long", "store" : "no"},
            "thread" : {"type" : "long", "store" : "no", "index" : "no"}
        }
    }
}'
 
curl -XPUT 'http://localhost:9200/xenforo_ign/thread/_mapping' -d '
{
    "thread" : {
        "_source" : {
            "enabled" : false
        },
        "properties" : {
            "message" : {"type" : "string", "store" : "no", "index" : "no"},
            "title" : {"type" : "string", "store" : "no"},
            "date" : {"type" : "long", "store" : "yes"},
            "user" : {"type" : "long", "store" : "yes"},
            "discussion_id" : {"type" : "long", "store" : "yes"},
            "thread_id" : {"type" : "long", "store" : "yes"},
            "node" : {"type" : "long", "store" : "no"},
            "prefix" : {"type" : "long", "store" : "no"},
            "thread" : {"type" : "long", "store" : "no", "index" : "no"}
        }
    }
}'
 
$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "xenforo_ign": {
        "settings": {
            "index.analysis.analyzer.default.language": "English", 
            "index.analysis.analyzer.default.type": "snowball", 
            "index.number_of_replicas": "1", 
            "index.number_of_shards": "5"
        }
    }
}
 
$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "_shards": {
        "failed": 0, 
        "successful": 10, 
        "total": 10
    }, 
    "indices": {
        "xenforo_ign": {
            "docs": {
                "deleted_docs": 3714, 
                "max_doc": 3437784, 
                "num_docs": 3434070
            }, 
            "index": {
                "primary_size": "1.4gb", 
                "primary_size_in_bytes": 1574012401, 
                "size": "2.9gb", 
                "size_in_bytes": 3148021951
            }, 
            "merges": {
                "current": 0, 
                "total": 3302, 
                "total_time": "33.5m", 
                "total_time_in_millis": 2010177
            }, 
            "refresh": {
                "total": 31017, 
                "total_time": "29.9m", 
                "total_time_in_millis": 1794563
            },

giorgino · Jan 20, 2012

Hi Mike, can you explain this a little more?
thx

ragtek · Jan 20, 2012

giorgino said:
Hi Mike, can you explain this a little more?
thx

hope this links helps: http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html

giorgino · Jan 20, 2012

Thank you ragtek, but this go over my comprehension (witch comprehension?

)

Mike Tougeron · Jan 20, 2012

Basically by default ElasticSearch stores the entire document you send to it with the index. This means if you have a lot of large messages the index & storage size can grow pretty big. Since XenES doesn't actually use the message from the original document sent to ES (it uses the IDs and then loads the results from MySQL) this mapping makes it so that the extra data isn't stored inside of ES. In the thread index, the message field is sent but it isn't searched. So I disabled storing and indexing of the field.

p.s., you may have noticed I didn't include the profile posts index. That's because we are using MyIGN for that data not XenForo.

Mike · Jan 20, 2012

Mike Tougeron said:
Basically by default ElasticSearch stores the entire document you send to it with the index.

Of course I can't find it now, but I'm pretty sure that ES doesn't store fields by default.

Mike Tougeron · Jan 20, 2012

It doesn't store the fields by default, but it does store the _source.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html

The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executed “fetch” requests, like get or search, the _source field is returned by default.

Though very handy to have around, the source field does incur storage overhead within the index.

This means that if you don't store the _source you need to explicitly store fields that you want returned. In the case of XenForo this includes fields like date, user_id, etc.

Mike Tougeron · Jan 20, 2012

btw, fwiw, I thought so originally too. It wasn't until I blew up our dev ES install that I did the deep dive research.

Floren · Jan 21, 2012

Mike, what is the value for Elastic MAX memory allocation (ES_MAX_MEM) on your setup? How many posts IGN has now? I'm trying to determine the proper ratio between the number of posts and the total memory usage Elastic needs to serve search queries fast.

Can you run Siege on your test server to emulate 30,000 online users while performing some random searches in parallel? Let me know the response time you get on search queries. I presume these are the average numbers IGN gets on a regular basis, not peak time.

Mike Tougeron · Jan 24, 2012

Floren said:
Mike, what is the value for Elastic MAX memory allocation (ES_MAX_MEM) on your setup?

The eng who maintains our ES farm isn't in yet but I sent him an email to find out more about our config for you.

Floren said:
How many posts IGN has now?

124,779 discussions & 3,370,918 posts. But when we are finished with the migration we'll have approx 70m posts.

Floren said:
Can you run Siege on your test server to emulate 30,000 online users while performing some random searches in parallel? Let me know the response time you get on search queries. I presume these are the average numbers IGN gets on a regular basis, not peak time.

I don't have a "real" performance environment right now so I can't really scale our testing accurately.

Once we're fully migrated to XenForo I'll post lots of info about what we're using, our performance stats, etc.

RobParker · May 1, 2012

I have ES up and running on our test install and the search index rebuilt.

What would I need to do to implement this? Is it literally just a case of pasting the above on the command line?

Slavik · May 1, 2012

RobParker said:
I have ES up and running on our test install and the search index rebuilt.

What would I need to do to implement this? Is it literally just a case of pasting the above on the command line?

Yup

Edited to your node obviously

RobParker · May 1, 2012

Cheers

I think I might need a little help if you don't mind.

1) Our data is stored at /elasticsearch/data/cloud-spurs/nodes/0

How would I then change http://localhost:9200/xenforo_ign/post/_mapping for our node? I don't quite see how that relates.

2) Also this seems very specific for IGN's setup:

Code:

$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "_shards": {
        "failed": 0, 
        "successful": 10, 
        "total": 10
    }, 
    "indices": {
        "xenforo_ign": {
            "docs": {
                "deleted_docs": 3714, 
                "max_doc": 3437784, 
                "num_docs": 3434070
            }, 
            "index": {
                "primary_size": "1.4gb", 
                "primary_size_in_bytes": 1574012401, 
                "size": "2.9gb", 
                "size_in_bytes": 3148021951
            }, 
            "merges": {
                "current": 0, 
                "total": 3302, 
                "total_time": "33.5m", 
                "total_time_in_millis": 2010177
            }, 
            "refresh": {
                "total": 31017, 
                "total_time": "29.9m", 
                "total_time_in_millis": 1794563
            },

Should I be using those numbers or somehow deriving my own? our elasticsearch.yml is the default (apart from setting our cluster name)

Slavik · May 1, 2012

RobParker said:
Cheers

Should I be using those numbers or somehow deriving my own? our elasticsearch.yml is the default (apart from setting our cluster name)

I'll put a guide up later tonight or tomorow that will explain it better.

p4guru · May 5, 2012

Slavik said:
I'll put a guide up later tonight or tomorow that will explain it better.

thanks much appreciated

IGN's ElasticSearch _mapping

Mike Tougeron

Well-known member

giorgino

Well-known member

ragtek

Guest

giorgino

Well-known member

Mike Tougeron

Well-known member

Mike

XenForo developer

Mike Tougeron

Well-known member

Mike Tougeron

Well-known member

Floren

Well-known member

Mike Tougeron

Well-known member

RobParker

Well-known member

Slavik

XenForo moderator

RobParker

Well-known member

Slavik

XenForo moderator

p4guru

Well-known member

Similar threads

We value your privacy