IGN's ElasticSearch _mapping

Mike Tougeron

Well-known member
At IGN we use the following _mapping for our ElasticSearch index. For about ~3 million messages this reduced the size of our index by > 30%.

Code:
curl -XPUT 'http://localhost:9200/xenforo_ign/post/_mapping' -d '
{
    "post" : {
        "_source" : {
            "enabled" : false
        },
        "properties" : {
            "message" : {"type" : "string", "store" : "no"},
            "title" : {"type" : "string", "store" : "no", "index" : "no"},
            "date" : {"type" : "long", "store" : "yes"},
            "user" : {"type" : "long", "store" : "yes"},
            "discussion_id" : {"type" : "long", "store" : "yes"},
            "node" : {"type" : "long", "store" : "no"},
            "prefix" : {"type" : "long", "store" : "no"},
            "thread" : {"type" : "long", "store" : "no", "index" : "no"}
        }
    }
}'
 
curl -XPUT 'http://localhost:9200/xenforo_ign/thread/_mapping' -d '
{
    "thread" : {
        "_source" : {
            "enabled" : false
        },
        "properties" : {
            "message" : {"type" : "string", "store" : "no", "index" : "no"},
            "title" : {"type" : "string", "store" : "no"},
            "date" : {"type" : "long", "store" : "yes"},
            "user" : {"type" : "long", "store" : "yes"},
            "discussion_id" : {"type" : "long", "store" : "yes"},
            "thread_id" : {"type" : "long", "store" : "yes"},
            "node" : {"type" : "long", "store" : "no"},
            "prefix" : {"type" : "long", "store" : "no"},
            "thread" : {"type" : "long", "store" : "no", "index" : "no"}
        }
    }
}'
 
$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "xenforo_ign": {
        "settings": {
            "index.analysis.analyzer.default.language": "English", 
            "index.analysis.analyzer.default.type": "snowball", 
            "index.number_of_replicas": "1", 
            "index.number_of_shards": "5"
        }
    }
}
 
$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "_shards": {
        "failed": 0, 
        "successful": 10, 
        "total": 10
    }, 
    "indices": {
        "xenforo_ign": {
            "docs": {
                "deleted_docs": 3714, 
                "max_doc": 3437784, 
                "num_docs": 3434070
            }, 
            "index": {
                "primary_size": "1.4gb", 
                "primary_size_in_bytes": 1574012401, 
                "size": "2.9gb", 
                "size_in_bytes": 3148021951
            }, 
            "merges": {
                "current": 0, 
                "total": 3302, 
                "total_time": "33.5m", 
                "total_time_in_millis": 2010177
            }, 
            "refresh": {
                "total": 31017, 
                "total_time": "29.9m", 
                "total_time_in_millis": 1794563
            },
 
Basically by default ElasticSearch stores the entire document you send to it with the index. This means if you have a lot of large messages the index & storage size can grow pretty big. Since XenES doesn't actually use the message from the original document sent to ES (it uses the IDs and then loads the results from MySQL) this mapping makes it so that the extra data isn't stored inside of ES. In the thread index, the message field is sent but it isn't searched. So I disabled storing and indexing of the field.

p.s., you may have noticed I didn't include the profile posts index. That's because we are using MyIGN for that data not XenForo.
 
It doesn't store the fields by default, but it does store the _source.

http://www.elasticsearch.org/guide/reference/mapping/source-field.html
The _source field is an automatically generated field that stores the actual JSON that was used as the indexed document. It is not indexed (searchable), just stored. When executed “fetch” requests, like get or search, the _source field is returned by default.

Though very handy to have around, the source field does incur storage overhead within the index.

This means that if you don't store the _source you need to explicitly store fields that you want returned. In the case of XenForo this includes fields like date, user_id, etc.
 
Mike, what is the value for Elastic MAX memory allocation (ES_MAX_MEM) on your setup? How many posts IGN has now? I'm trying to determine the proper ratio between the number of posts and the total memory usage Elastic needs to serve search queries fast.

Can you run Siege on your test server to emulate 30,000 online users while performing some random searches in parallel? Let me know the response time you get on search queries. I presume these are the average numbers IGN gets on a regular basis, not peak time.
 
Mike, what is the value for Elastic MAX memory allocation (ES_MAX_MEM) on your setup?
The eng who maintains our ES farm isn't in yet but I sent him an email to find out more about our config for you.

How many posts IGN has now?
124,779 discussions & 3,370,918 posts. But when we are finished with the migration we'll have approx 70m posts.

Can you run Siege on your test server to emulate 30,000 online users while performing some random searches in parallel? Let me know the response time you get on search queries. I presume these are the average numbers IGN gets on a regular basis, not peak time.
I don't have a "real" performance environment right now so I can't really scale our testing accurately. :( Once we're fully migrated to XenForo I'll post lots of info about what we're using, our performance stats, etc.
 
I have ES up and running on our test install and the search index rebuilt.

What would I need to do to implement this? Is it literally just a case of pasting the above on the command line?
 
Cheers

I think I might need a little help if you don't mind.

1) Our data is stored at /elasticsearch/data/cloud-spurs/nodes/0

How would I then change http://localhost:9200/xenforo_ign/post/_mapping for our node? I don't quite see how that relates.

2) Also this seems very specific for IGN's setup:

Code:
$> curl 'http://localhost:9200/xenforo_ign/_settings'
{
    "_shards": {
        "failed": 0, 
        "successful": 10, 
        "total": 10
    }, 
    "indices": {
        "xenforo_ign": {
            "docs": {
                "deleted_docs": 3714, 
                "max_doc": 3437784, 
                "num_docs": 3434070
            }, 
            "index": {
                "primary_size": "1.4gb", 
                "primary_size_in_bytes": 1574012401, 
                "size": "2.9gb", 
                "size_in_bytes": 3148021951
            }, 
            "merges": {
                "current": 0, 
                "total": 3302, 
                "total_time": "33.5m", 
                "total_time_in_millis": 2010177
            }, 
            "refresh": {
                "total": 31017, 
                "total_time": "29.9m", 
                "total_time_in_millis": 1794563
            },
Should I be using those numbers or somehow deriving my own? our elasticsearch.yml is the default (apart from setting our cluster name)
 
Top Bottom