Fast reindex for large site

Jim Boy

Well-known member
Following an upgrade of elasticsearch to 5.4 (the index was originally created as a 0.90) and putting in @Xon 's excellent plugins I have to do a full re-index.

Anyone got any alternative methods for doing a fast re-index? With 50 million posts the normal admin system is impossibly slow - and really 50 million items is bugger all for elasticsearch - I do 400 million+ docs a day on another site
 
  • Like
Reactions: Xon
Alas it aint so fast - still looking at several days - seems bound on CPU rather than database or elasticsearch.

Would be ok if it multi-threaded, but as it is I can't scale this out :(
 
Alas it aint so fast - still looking at several days - seems bound on CPU rather than database or elasticsearch.

Would be ok if it multi-threaded, but as it is I can't scale this out :(

There is something wrong with your setup then.

A million posts can be done in a couple of minutes.
 
There is something wrong with your setup then.
Like what? On the discussion thread for your plugin you say that it isn''t really much faster than the standard routine - its just that it can be set to run in the background. How do I go about debuging this, as it is, it has completely stalled at around 20,000,000 posts
 
Like what? On the discussion thread for your plugin you say that it isn''t really much faster than the standard routine - its just that it can be set to run in the background. How do I go about debuging this, as it is, it has completely stalled at around 20,000,000 posts

Check the JVM logs. Most likely a problem with Java
 
Not seeing any errors there - right now the script is consuming a lot of cpu but not actually submitting anything to elasticsearcgh - which is getting submissions from normal Xenforo traffic. appears to be spinning its wheels
 
OK - I worked out why it starting crapping out at 20,000,000 posts - overlooked that I had set this box up as as an ec2 t2.medium, and after around 20,000,000 it ran out of cpu credits.

Before that it was running very slowly though

So as an alternative I am trialling logstash to get elasticsearch populated and so far it looks pretty good - have indexed 33 million objects in 7 hours so far. Would like to validate that my assumptions on how the data gets chosen and filtered is valid so would like some feedback on my logstash conf file below, which does take into account the plugins of @Xon

Cheers

Code:
input {
  jdbc {
    jdbc_driver_library => "/usr/share/logstash/vendor/jar/mysql-connector-java-5.1.42-bin.jar"
    jdbc_driver_class => "com.mysql.jdbc.Driver"
    jdbc_connection_string => "jdbc:mysql://bigfooty2.csv9hrfsbhfo.us-west-2.rds.amazonaws.com:3306/bigfooty"
    statement => "select post_id, t.node_id as node, p.thread_id as thread, p.user_id as user, p.message, case when position = 0 then t.title else NULL end title, message_state, p.post_date as date, prefix_id as prefix from xf_post p, xf_thread t where p.thread_id=t.thread_id and post_id > :sql_last_value order by post_id limit 50000 "
    jdbc_paging_enabled => "true"
    jdbc_page_size => "50000"
    last_run_metadata_path => "/usr/share/logstash/last/.logstash_jdbc_last_run"
    type => "post"
    tracking_column => "post_id"
    use_column_value => true
  }
}
filter {

    if [type] == "post"{
        if [message_state] != "visible" {
            mutate{
                add_field => {"not_visible" => true}
            }
        }
        if "" in [title]{
            mutate{
                add_field => {"xm_elasticess_title" => "%{title}"}
            }
        }
        mutate{
            remove_field => ["message_state"]
            add_field => {"discussion_id" => "%{thread}"}
        }
        mutate{
            convert => {"thread" => "integer"}
            convert => {"discussion_id" => "integer"}
            convert => {"user" => "integer"}
            convert => {"date" => "integer"}
            convert => {"prefix" => "integer"}
            convert => {"node" => "integer"}
        }
        if [prefix] == 0{
            mutate{
               remove_field => ["prefix"]
            }
        }
    }

}
output {
  if [type] == "post" {
    elasticsearch {
      hosts =>  ["10.38.11.11"]
      document_id =>  "%{post_id}"
      index => "bigfooty2"
      flush_size => 10000
    }
  }
}
 
  • Like
Reactions: Xon
@Jim Boy I'm fairly sure you need to map post_id to _id, and xm_elasticess_title be added if title is not empty (vs is empty if I'm reading that right?).

This doesn't add tags, but adding those to the first post's title text isn't really needed.
 
Hey @Xon, thanks for the reply

@Jim Boy I'm fairly sure you need to map post_id to _id,
_id gets assigned by using "document_id" in the elasticsearch output - it is illegal to assign it explicitly to '_id' in a mutate statement
@Jim Boy
and xm_elasticess_title be added if title is not empty (vs is empty if I'm reading that right?).
I dont think you are reading that right - The 'if "" in [title]' condiition really only says does this condition exist and is it a string. I only set title on the first post on any thread as I am guessing that is how it works for Xenforo. Not sure if elasticsearch stores empty strings but seeing as xm_elasticess_title min-ngram=5 I think the question is irrelevant. Really not too sure if my assumption about XF behaviour is correct and I am concerend I may be missing other scenarios
This doesn't add tags, but adding those to the first post's title text isn't really needed.
Not actually sure what you are saying here.
 
Top Bottom