Partial fix Deleting users is unexpectedly slow due to missing indexes

Xon

Well-known member
I recently had to delete ~60k (spam) accounts, and I managed to almost double the deleting speed adding the following indexes;

Code:
alter table xf_thread add index last_post_user_id  (last_post_user_id);
alter table xf_report add index last_modified_user_id (last_modified_user_id);
alter table xf_warning  add index warning_user_id (warning_user_id );
alter table xf_spam_cleaner_log add index applying_user_id (applying_user_id );
alter table xf_spam_cleaner_log add index user_id (user_id );
The thread one is likely the most important one, this mostly impacts account renames and deletes
 
FWIW, not having that thread index is mostly intentional. It would generally only apply in the delete/rename case and it would require updates on each post made. It's ephemeral data, so I can't see any other use case for it. Thus, the benefit to this case didn't seem to be worth the overhead of the index in most cases.

The mentality on the report table would largely be similar. With the other tables, an index is more within the realm of the type of thing we would index as there are potential use cases for putting a condition on that column.
 
Renames are frustratingly common, and every moderated signup that is rejected is a delete which hits all those tables.

SpaceBattles has both happen fairly often, but I've documented those indexes as recommended on my User Essentials add-on which allows user initiated renames.
 
I've ended up migrating the performance impact of this in one of my add-ons by breaking up the content updates so the slow updates occur outside of a transaction and are then processed by a deferred task.

To get sane query performance while being able to control the update batch size; I needed to break it up into multiple steps. The intermediate table is required as otherwise the update statement locks more rows than it needs and/or has to select the entire set every iteration. This is very noticeable when dealing with users with >20000 posts.
Code:
                    $db->query('
                        DELETE FROM useress_unc_queue_content
                        WHERE batch_user_id = ?
                    ', $existingUserId);

                    $db->query('
                        INSERT IGNORE INTO useress_unc_queue_content(batch_user_id, batch_rank, batch_content_id)
                        select ?, @rownum := @rownum + 1 AS batch_rank, ' . $offsetCol . '
                        from ' . $table . ', (SELECT @rownum := 0) r
                        where ' . $userIdCol . ' = ?
                    ', array($existingUserId, $existingUserId));

                    $tableOffset = $db->fetchOne('select max(batch_rank) from useress_unc_queue_content where batch_user_id = ?', $existingUserId);
...
                    $stmt = $db->query('
                        UPDATE IGNORE ' . $table . '
                        JOIN useress_unc_queue_content batch on batch.batch_user_id = ? and batch.batch_content_id = ' . $table . '.' . $offsetCol . '
                        SET ' . implode(',', $update) . '
                        WHERE ' . $userIdCol . ' = ? and batch.batch_rank between ? and ?
                    ', array($existingUserId, $existingUserId, $lower, $tableOffset));
Note; this counts down so users see recent content updated first and we just keep going till the batch hit zero to know we are done.

The standard XenForo rename process which touches demoralized data took over 75 seconds for one user rename (<15000 posts), which held a lock on xf_forum blocking updates. This prevented anyone replying and s obviously undesirable. This new process works well even for even for much larger amounts of content being updated.
 
I'm going to tag this as a partial fix. I've added the "normal" user ID indexes I mentioned initially. I'm not positive about the last post-related indexes on the whole, though I can see their value in specific scenarios.

Beyond the thread table, xf_conversation_master and xf_conversation_user could potentially use last message indexes as well (though I'm not sure if the last message user ID and username are actually used from xf_conversation_user).

There is a potential mitigation for threads and conversations though, as we have tables that track who has interacted with them (xf_thread_user_post and xf_conversation_recipient). These could be used through a join to limit the rows scanned. For users with very little content, this should be a significant speed up. I'd have to do testing to check users with more content, but even scanning 20000 rows vs 2 million should be a win. That's something to investigate at a later date though I think.
 
Top Bottom