We run a fairly large network of XF sites, running on AWS against Aurora MySql (our 100+ XF sites are partitioned across several writer/read replica clusters). It runs well for quite awhile (weeks, sometimes months) but with no obvious trigger the DB writer in a cluster pegs 100% CPU and requests start backing up and all sites partitioned to that DB cluster obviously stop responding with 5xx errors and timeouts.
Normally even under a decent load our writer CPU doesn't get past 30-40%. We usually have to reboot the DB writer (failing over to a replica often doesn't help, likely since existing transactions/locks are still blocking) and the site tends to come right back up and usually does not run into the issue again that day. Also while it might be somewhat related to traffic, there are very heavy days in which it runs without incident, and some lesser traffic days that hit this problem.
AWS insights points to a huge spike in wait/io/table/sql/handler, lists the typical SELECT from xf_thread as the top query, though I feel that's a downstream symptom, not a cause.
Anyone run into something like this with a similar configuration and have some advice? Or perhaps some guidance for digging into the state once it happens to uncover what the real underlying lock is on so we know what part of the code is making the query so we can understand the cause better?
Normally even under a decent load our writer CPU doesn't get past 30-40%. We usually have to reboot the DB writer (failing over to a replica often doesn't help, likely since existing transactions/locks are still blocking) and the site tends to come right back up and usually does not run into the issue again that day. Also while it might be somewhat related to traffic, there are very heavy days in which it runs without incident, and some lesser traffic days that hit this problem.
AWS insights points to a huge spike in wait/io/table/sql/handler, lists the typical SELECT from xf_thread as the top query, though I feel that's a downstream symptom, not a cause.
Anyone run into something like this with a similar configuration and have some advice? Or perhaps some guidance for digging into the state once it happens to uncover what the real underlying lock is on so we know what part of the code is making the query so we can understand the cause better?