XF 2.2 MySQL using 100% CPU with this query, can't make it stop

Phid · Apr 12, 2021

Dear Sir/Madam,

We recently upgraded to XF 2.2 and today we noticed the forum was running very slow and checking the server we noticed a high CPU usage coming from MySQL, checking the db process list we only found 3 processes running the same query which is:

SQL:

UPDATE (
                SELECT content_id FROM xf_reaction_content
                WHERE content_type = ?
                AND reaction_user_id = ?
            ) AS temp
            INNER JOIN xf_post AS reaction_table ON (reaction_table.`post_id` = temp.content_id)
            SET reaction_table.`reaction_users` = REPLACE(reaction_table.`reaction_users`, ?, ?)

Any idea how to make it stop?

briansol · Apr 12, 2021

Try rebuilding the cache via tools for user and maybe go to the db and optimize/repair the table to see if it helps?

Phid · Apr 12, 2021

Did an optimize and repair, MySQL didn't found anything wrong.
Already disabled all addons and cron jobs. Haven't done a rebuild cache for user since forum is very slow and getting 500.

Any other suggestion I can try?

Phid · Apr 12, 2021

Btw, I managed to grab this message from the AdminCP:

There are scheduled jobs outstanding which have not run. Jobs may not be getting triggered when expected.

Paul B · Apr 12, 2021

Which version of mysql is running?

Speaking with the dev's, this query was tweaked to make it more efficient in 2.2 (it's related to user name changes, merges, etc.).

Phid · Apr 12, 2021

Brogan said:
Which version of mysql is running?

We are running on version 5.7.33

Btw, most cron entries have the next run set to yesterday.

Paul B · Apr 12, 2021

How large is your site?

It's closed to guests so we can't see.

Phid · Apr 12, 2021

Brogan said:
How large is your site?

It's closed to guests so we can't see.

We have 16k users (not all active) and 427k posts and 2,494,800 entries on reaction content table.
PS. We closed the site for now due to the above issue, members were complaining about being slow and getting 500.

Phid · Apr 12, 2021

Did a query into xf_job and got 67k records, all of them are XF:UserDeleteCleanUp and triggered today when the issue started:

SQL:

SELECT job_id, unique_key, execute_class, FROM_UNIXTIME(trigger_date), last_run_date
FROM xf_job
WHERE last_run_date IS NULL
ORDER BY trigger_date

Are those the jobs that are running? It seems so.
Doing a count:

SQL:

SELECT COUNT(*) FROM xf_job WHERE last_run_date IS NULL

15 min before: 67,457 records
currently we have: 67,439 records

Update: Talking with another admin, those jobs kicked after we did a mass prune of inactive users which was yesterday.
Is there a way to speed things up? This is going to take a lot.

Kirby · Apr 12, 2021

https://xenforo.com/community/threads/xf-service-user-deletecleanup-causes-full-table-scans.157291/ might help

Phid · Apr 12, 2021

Kirby said:
https://xenforo.com/community/threads/xf-service-user-deletecleanup-causes-full-table-scans.157291/ might help

Thank you, just took a look at the tables mentioned in the post and also at the xf_reaction_content and xf_post and the indexes are there

I did an explain on the query above, this is what I got:

Code:

id    select_type    table    partitions    type    possible_keys    key    key_len    ref    rows    filtered    Extra
1    SIMPLE    xf_reaction_content        ref    content_type_id_user_id,content_type_id_reaction_date,reaction_user_id    reaction_user_id    4    const    190    50.00    Using where
1    UPDATE    reaction_table        eq_ref    PRIMARY    PRIMARY    4    admin_forum.xf_reaction_content.content_id    1    100.00

Mike · Apr 12, 2021

Based on the explain output, there isn't really much that can be done in XF and it really should be an efficient query. The sub-query is as likely as optimized as reasonably possible (based on the person doing the reaction) and the join to the main table is based on the primary key.

So based on that, this really shouldn't be a slow query, unless you deleted users that reacted to a large amount of content (meaning there are a lot of entries). That would normally be unlikely for an inactive user prune.

What does the process list show for these queries? (SHOW PROCESSLIST;)

The only other potential thing is making sure your server is optimized for writes. Some basics are covered here:

InnoDB Performance Optimization Basics

Learn the basics of InnoDB performance optimization for MySQL 5.7 & 8.0. Covers hardware, OS, and key InnoDB settings to boost your database speed.

www.percona.com

The most relevant bits for something like this would tend to be innodb_flush_log_at_trx_commit and innodb_flush_method, though as you have a reasonable amount of data, I'm assuming things like this have already been changed.

Phid · Apr 12, 2021

This is the output of SHOW PROCESSLIST;

Code:

Id    User    Host    db    Command    Time    State    Info

97166    root    localhost:53472    mysql    Query    0    starting    SHOW PROCESSLIST

97188    root    localhost:53560    mysql    Sleep    291       

97428    root    localhost:53994    admin_forum    Sleep    248       

98577    admin_forum    localhost    admin_forum    Execute    4    Sending data    UPDATE (

                SELECT content_id FROM xf_reaction_content   

WHERE content_type = ?

                AND reaction_user_id = ?

            ) AS temp

            INNER JOIN xf_post AS reaction_table ON (reaction_table.[ICODE]post_id[/ICODE] = temp.content_id)

            SET reaction_table.[ICODE]reaction_users[/ICODE] = REPLACE(reaction_table.[ICODE]reaction_users[/ICODE], ?, ?)

98597    admin_forum    localhost    admin_forum    Execute    1    Sending data    UPDATE (

                SELECT content_id FROM xf_reaction_content

WHERE content_type = ?

                AND reaction_user_id = ?

            ) AS temp

            INNER JOIN xf_post AS reaction_table ON (reaction_table.[ICODE]post_id[/ICODE] = temp.content_id)

            SET reaction_table.[ICODE]reaction_users[/ICODE] = REPLACE(reaction_table.[ICODE]reaction_users[/ICODE], ?, ?)

98600    admin_forum    localhost    admin_forum    Execute    5    updating reference tables    UPDATE (

                SELECT content_id FROM xf_reaction_content

WHERE content_type = ?

                AND reaction_user_id = ?

            ) AS temp

            INNER JOIN xf_post AS reaction_table ON (reaction_table.[ICODE]post_id[/ICODE] = temp.content_id)

            SET reaction_table.[ICODE]reaction_users[/ICODE] = REPLACE(reaction_table.[ICODE]reaction_users[/ICODE], ?, ?)

98606    admin_forum    localhost    admin_forum    Execute    2    Sending data    UPDATE (

                SELECT content_id FROM xf_reaction_content

WHERE content_type = ?

                AND reaction_user_id = ?

            ) AS temp

            INNER JOIN xf_post AS reaction_table ON (reaction_table.[ICODE]post_id[/ICODE] = temp.content_id)

            SET reaction_table.[ICODE]reaction_users[/ICODE] = REPLACE(reaction_table.[ICODE]reaction_users[/ICODE], ?, ?)

Going to have a look at innodb_flush_log_at_trx_commit

Phid · Apr 12, 2021

I logged the slow queries and is indeed taking the most on the one mentioned in the first post:

Code:

# Query_time: 5.349845  Lock_time: 3.955214 Rows_sent: 0  Rows_examined: 140370
SET timestamp=1618266909;
UPDATE (
                SELECT content_id FROM xf_reaction_content
                WHERE content_type = 'post'
                AND reaction_user_id = '0'
            ) AS temp
            INNER JOIN xf_post AS reaction_table ON (reaction_table.`post_id` = temp.content_id)
            SET reaction_table.`reaction_users` = REPLACE(reaction_table.`reaction_users`, '\"user_id\":326220,\"username\":\"momo30\"', '\"user_id\":0,\"username\":\"momo30\"');

# Query_time: 5.249052  Lock_time: 3.945376 Rows_sent: 0  Rows_examined: 140378
SET timestamp=1618266910;
UPDATE (
                SELECT content_id FROM xf_reaction_content
                WHERE content_type = 'post'
                AND reaction_user_id = '0'
            ) AS temp
            INNER JOIN xf_post AS reaction_table ON (reaction_table.`post_id` = temp.content_id)
            SET reaction_table.`reaction_users` = REPLACE(reaction_table.`reaction_users`, '\"user_id\":326221,\"username\":\"lamerhouse\"', '\"user_id\":0,\"username\":\"lamerhouse\"');

# Query_time: 1.402656  Lock_time: 0.000035 Rows_sent: 0  Rows_examined: 140244
SET timestamp=1618266831;
UPDATE (
                SELECT content_id FROM xf_reaction_content
                WHERE content_type = 'post'
                AND reaction_user_id = '0'
            ) AS temp
            INNER JOIN xf_post AS reaction_table ON (reaction_table.`post_id` = temp.content_id)
            SET reaction_table.`reaction_users` = REPLACE(reaction_table.`reaction_users`, '\"user_id\":326162,\"username\":\"Down7\"', '\"user_id\":0,\"username\":\"Down7\"');

# Query_time: 5.627615  Lock_time: 3.936965 Rows_sent: 0  Rows_examined: 140402
SET timestamp=1618266945;
UPDATE (
                SELECT content_id FROM xf_reaction_content
                WHERE content_type = 'post'
                AND reaction_user_id = '0'
            ) AS temp
            INNER JOIN xf_post AS reaction_table ON (reaction_table.`post_id` = temp.content_id)
            SET reaction_table.`reaction_users` = REPLACE(reaction_table.`reaction_users`, '\"user_id\":326248,\"username\":\"audrius\"', '\"user_id\":0,\"username\":\"audrius\"');

Mike · Apr 13, 2021

So looking at the slow query log, I think this is giving a bit more detail. It looks like of the users you deleted, you may have deleted many which did react to things and that potentially makes this process slower over time.

These rebuilds happen after the reassignment happens (which is the most fundamental and important part). As such, it ends up looking at all reactions by now-deleted users, so this query is getting slower over time as you have more reactions to consider.

Ideally, I'd say let it run through, though if you have 67,000 records to move through, that's probably not ideal. This is primarily a cache related updated, so it's not the end of the world if this doesn't get updated. So you could temporarily add the line in red to src/XF/Reaction/AbstractHandler.php:

Rich (BB code):

public function updateRecentCacheForUserChange($oldUserId, $newUserId, $oldUsername, $newUsername)
{
   return;
   if (empty($this->contentCacheFields['recent']))
   {
      return;
   }

Once the jobs have cleared, be sure to remove it.

Xon · Apr 13, 2021

The problem is you've got a large number of users who have been removed, and this has generated a very large number of rows which match xf_reaction_content.reaction_user_id = 0.

This is happening as the xf_reaction_content.reaction_user_id update is fast, but the reaction-cache update snowballs and gets slower and slower.

I think a fix would be that stepReassignContent should not update xf_reaction_content.reaction_user_id and stepRebuildLikes/stepRebuildReactions should use the actual user id for the xf_reaction_content.reaction_user_id = ? lookup.

Then have a cleanup step inserted after stepRebuildLikes/stepRebuildReactions to zero out reaction_user_id

Phid · Apr 13, 2021

Xon said:
The problem is you've got a large number of users who have been removed, and this has generated a very large number of rows which match xf_reaction_content.reaction_user_id = 0.

This is happening as the xf_reaction_content.reaction_user_id update is fast, but the reaction-cache update snowballs and gets slower and slower.

I think a fix would be that stepReassignContent should not update xf_reaction_content.reaction_user_id and stepRebuildLikes/stepRebuildReactions should use the actual user id for the xf_reaction_content.reaction_user_id = ? lookup.

Then have a cleanup step inserted after stepRebuildLikes/stepRebuildReactions to zero out reaction_user_id

The steps you mentioned sounds more efficient, I will take a look into it! I'm not a PHP developer (I'm more of a .NET and JS guy) so I may perhaps struggle with doing such changes... will see. Would be nice if XF devs consider your suggestion since this scenario has room for improvements, what do you think @Mike ?

Mike said:
So looking at the slow query log, I think this is giving a bit more detail. It looks like of the users you deleted, you may have deleted many which did react to things and that potentially makes this process slower over time.

These rebuilds happen after the reassignment happens (which is the most fundamental and important part). As such, it ends up looking at all reactions by now-deleted users, so this query is getting slower over time as you have more reactions to consider.

Ideally, I'd say let it run through, though if you have 67,000 records to move through, that's probably not ideal. This is primarily a cache related updated, so it's not the end of the world if this doesn't get updated. So you could temporarily add the line in red to src/XF/Reaction/AbstractHandler.php:

Rich (BB code):

public function updateRecentCacheForUserChange($oldUserId, $newUserId, $oldUsername, $newUsername) { return; if (empty($this->contentCacheFields['recent'])) { return; }

Once the jobs have cleared, be sure to remove it.

Thanks for your reply, we opted to make a backup of the xf_jobs table and prune the ones that were slowing down the server.
I will give it a shot to @Xon suggestion and if I get into any trouble doing such changes I may try yours and restore the backup so the cleaning operation concludes.

Mike · Apr 13, 2021

Xon's change is a more significant undertaking. What I was proposing was really just a simple workaround to deal with your specific problem right now.

Phid said:
Thanks for your reply, we opted to make a backup of the xf_jobs table and prune the ones that were slowing down the server.

The jobs in question do quite a bit more than the query we've discussed here, so you really do want those to all run through safely to ensure that you're not leaving a lot of other orphaned data. The change I suggested skips the one very specific bit that was slow which isn't really significant.

So I would definitely recommend letting the jobs run with the workaround rather than completely removing them.

Phid · Apr 13, 2021

Mike said:
Xon's change is a more significant undertaking. What I was proposing was really just a simple workaround to deal with your specific problem right now.

Thanks Mike, I meant to say the suggestions by Xon to be considered perhaps by a future update.

Mike said:
The jobs in question do quite a bit more than the query we've discussed here, so you really do want those to all run through safely to ensure that you're not leaving a lot of other orphaned data. The change I suggested skips the one very specific bit that was slow which isn't really significant.

So I would definitely recommend letting the jobs run with the workaround rather than completely removing them.

Yes, I'm aware of it... we just needed to have the forum running back again until there was a workaround.

AndyB · Apr 13, 2021

Mike said:
This is primarily a cache related updated, so it's not the end of the world if this doesn't get updated.

You might like this add-on:

https://xenforo.com/community/resources/rebuild-post-reactions-cache.7035/

XF 2.2 MySQL using 100% CPU with this query, can't make it stop

Member

Well-known member

Member

Member

There are scheduled jobs outstanding which have not run. Jobs may not be getting triggered when expected.​

XenForo moderator

Member

XenForo moderator

Member

Member

Well-known member

Member

XenForo developer

Member

Member

XenForo developer

Well-known member

Member

XenForo developer

Member

Well-known member

We value your privacy

There are scheduled jobs outstanding which have not run. Jobs may not be getting triggered when expected.