Occasional MySql issue, hits 100% CPU

LukeD

Member
We run a fairly large network of XF sites, running on AWS against Aurora MySql (our 100+ XF sites are partitioned across several writer/read replica clusters). It runs well for quite awhile (weeks, sometimes months) but with no obvious trigger the DB writer in a cluster pegs 100% CPU and requests start backing up and all sites partitioned to that DB cluster obviously stop responding with 5xx errors and timeouts.

Normally even under a decent load our writer CPU doesn't get past 30-40%. We usually have to reboot the DB writer (failing over to a replica often doesn't help, likely since existing transactions/locks are still blocking) and the site tends to come right back up and usually does not run into the issue again that day. Also while it might be somewhat related to traffic, there are very heavy days in which it runs without incident, and some lesser traffic days that hit this problem.

AWS insights points to a huge spike in wait/io/table/sql/handler, lists the typical SELECT from xf_thread as the top query, though I feel that's a downstream symptom, not a cause.

Anyone run into something like this with a similar configuration and have some advice? Or perhaps some guidance for digging into the state once it happens to uncover what the real underlying lock is on so we know what part of the code is making the query so we can understand the cause better?
 
If you have forums with a very high number of threads in it, it can be problematic when a user (or spider) is viewing high numbered pages within that forum (for example page 10000 in a forum with many, many threads). The reason is MySQL needs to internally fetch all the threads in that forum to sort them and figure out which ones are on page 10000. Basically the deeper into forum pages you are, the slower the query will be.
 
If you have forums with a very high number of threads in it, it can be problematic when a user (or spider) is viewing high numbered pages within that forum (for example page 10000 in a forum with many, many threads). The reason is MySQL needs to internally fetch all the threads in that forum to sort them and figure out which ones are on page 10000. Basically the deeper into forum pages you are, the slower the query will be.
Thanks, this is very likely one of our issues. I happened to stumble across the node setting that hides threads/pages beyond a certain date from end users to prevent the attempted load of the super high pages, and it seems to have helped a lot.
 
Ya, manually paging thousands of pages deep is rather useless. By the time the thread is that old, search engines have already spidered it (and it's in your sitemap) and for normal users, no one is going to page that deep looking for something. Threads past the cutoff are fully searchable.
 
try Releem, it's rather good at long term analytics and optimizations. It's vastly improved my hit/cache ratio for mysql as well as other performance improvements.
 
Thanks, this is very likely one of our issues. I happened to stumble across the node setting that hides threads/pages beyond a certain date from end users to prevent the attempted load of the super high pages, and it seems to have helped a lot.
Where can this setting be found? This seems likely to be at least a similar issue to what we are running into as well, but we are on MariaDB but otherwise the original problem sounds like what we are experiencing. Forum is up behind cloudflare, but it is old and has many posts and lots of threads. When we tend to get a traffic spike, the Database write seems to fall over and takes out the site.

I would also be open to any other advice on debugging/adjusting things to help fix this. It probably also doesn't help that the DB is old and throws errors if you try and update it to newer MySQL or MariaDB. If anyone has insights on how to or people who could help with that, it would be useful.
 
Last edited:
Thread list date limit in the node settings. Its under advanced options so make sure you have it ticked.
 
Top Bottom