XF 1.3 Elasticsearch + Locked DB Strange Errors

Wesker

Well-known member
Please note one of the well known custom development 3rd party teams at xenForo have requested we post a ticket here as they are stuck on this issue.

Earlier today, we had an issue with a disk drive reaching it's max causing the server to go briefly down before we truncated the DB disk drive.

Since the server has been restored, everything appears to be running fine except the server error log is going crazy with errors.

Error 1: Elasticsearch

Elasticsearch is installed and appears to be running

ES running
[root@server config]# curl http://127.0.0.1:9200
{
"status" : 200,
"name" : "Arclight",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.5",
"build_hash" : "2aaf797f2a571dcb779a3b61180afe8390ab61f9",
"build_timestamp" : "2015-04-27T08:06:06Z",
"build_snapshot" : false,
"lucene_version" : "4.10.4"
},
"tagline" : "You Know, for Search"
}

However it's being bombarded with pages and pages of errors

XenForo_Exception: Elasticsearch server returned no response. Is it running? Elasticsearch indexing failed - library/XenES/Search/SourceHandler/ElasticSearch.php:833

After this 3rd party team reviewed the

Error 2: Mysqli statement execute error : Lock wait timeout exceeded; try restarting transaction

The next issue is the db is being bombarded with the same sql issues over and over. They have informed me the server looks fine.

[2016-04-07 14:49:26,269][DEBUG][action.index ] [Arclight] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

Also note I'm still use 1.3 (will upgrade to 1.5) once one more mod has been fixed to support 1.5.
 
Both of these make me think there's significant server load. What are the load averages?

Is there anything in Elasticsearch's log?

In terms of MySQL, what is showing in the process list (SHOW FULL PROCESSLIST; ) ? Running queries? Long running connections? How about InnoDB (SHOW INNODB STATUS; )

If this all happened due to a disk issue, then the side effects of that are likely the issue. This could be, for example, data corruption in either location, which may be causing various issues as a knock on.
 
Both of these make me think there's significant server load. What are the load averages?

Will check on this in a moment. It doesn't appear we're being DDOS. The server is running fine it's just processing db queries.

Is there anything in Elasticsearch's log?

Elasticsearch log = /admin.php?brsql-log/ correct?

If so yes many searches. We can still load elasticsearch but errors everywhere so I assume others maybe getting errors.

In terms of MySQL, what is showing in the process list (SHOW FULL PROCESSLIST; ) ? Running queries? Long running connections? How about InnoDB (SHOW INNODB STATUS; )

Long running connections. Let me check on the rest for you.

If this all happened due to a disk issue, then the side effects of that are likely the issue. This could be, for example, data corruption in either location, which may be causing various issues as a knock on.

This is probably our best lead. What do you recommend we do here so we can quickly check this?
 
Okay waiting for more logs to compile here. File was accidentally removed. Will send that in a bit.

Main issue though is the db locks. It could all be connected to one thing.
 
1). Still no luck so far on the lock/deadlock in mySQL
2). I have yet to experience any issues with the search. We're getting errors but everything seems to be running smoothly with it.

#1 obviously the biggest issue.
 
You can send it in a conversation.

You'll also want to look at any MySQL logs, particularly from any point after the disk issue. They'll be in /var/log most likely.
 
This isn't really something that's officially covered by our support (server config). The fact that everything happened after a hardware/disk issue rather than any change to XenForo certainly points to that as the underlying problem. I don't know what knock on effects it could have had to the data stores of Elasticsearch and MySQL, or even to the general functioning of the server.

Based on the logs you sent me, everything is just running very slow. I see evidence of queries waiting 30 seconds for a particular lock to be released. It's a trivial query itself, though I can't necessarily know what else the potential query is doing. There are a couple transactions running over 30 seconds (without showing them waiting in MySQL). The vast majority of XF pages would be under a second in normal operation.

Unfortunately, I don't have a great recommendation. It's worth pointing out that in general, the issues seem to be with writes. They're slow in MySQL and your Elasticsearch error was failed indexing. That might even point to something with the underlying file system. You could try restarting the services, maybe even wiping your Elasticsearch index and rebuilding it (you could do the similar by dumping and restoring the MySQL DB as well), though I'm not convinced that would help if there's not data corruption.

It certainly won't solve the issue, but you probably want to turn off your add-on that shows the online user status in posts. It will exacerbate the issues a bit.
 
Top Bottom