Fixed xf_session_activity: MySQL query error [1205]: Lock wait timeout exceeded; try restarting transaction

PaulB

Well-known member
Affected version
2.2.7 Patch 1
We recently upgraded from XF1.5.x to XF2.2.7 Patch 1. This was quite an ordeal, as we had a lot of custom add-ons to port, but it's gone relatively well so far. That being said, xf_session_activity has become an issue. Today I woke up to 18 pages of error logs, and that's been happening consistently since we upgraded.

As a disclaimer, we're not using the MEMORY engine, which seems to be the default for xf_session_activity. We can't use it: except under very limited circumstances, MEMORY can't be used in conjunction with replication:
When a replica server shuts down and restarts, its MEMORY tables become empty. This causes the replica to be out of synchrony with the source and may lead to other failures...

It also shouldn't matter--in fact, in our case, it's quite likely that using the MEMORY engine would exacerbate the issue:
Despite the in-memory processing for MEMORY tables, they are not necessarily faster than InnoDB tables on a busy server, for general-purpose queries, or under a read/write workload. In particular, the table locking involved with performing updates can slow down concurrent usage of MEMORY tables from multiple sessions.

Here's the problematic query:
SQL:
-- XFDB=noForceAllWrite
INSERT INTO xf_session_activity
    (`user_id`, `unique_key`, `ip`, `controller_name`, `controller_action`, `view_state`, `params`, `view_date`, `robot_key`)
VALUES
    (?, ?, ?, ?, ?, ?, ?, ?, ?)
ON DUPLICATE KEY UPDATE ip = VALUES(ip),
    controller_name = VALUES(controller_name),
    controller_action = VALUES(controller_action),
    view_state = VALUES(view_state),
    params = VALUES(params),
    view_date = VALUES(view_date),
    robot_key = VALUES(robot_key)

And the schema:
SQL:
CREATE TABLE `xf_session_activity` (
  `user_id` int unsigned NOT NULL,
  `unique_key` varbinary(16) NOT NULL,
  `ip` varbinary(16) NOT NULL DEFAULT '',
  `controller_name` varchar(100) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `controller_action` varchar(75) CHARACTER SET ascii COLLATE ascii_bin NOT NULL,
  `view_state` enum('valid','error') CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  `params` varbinary(100) NOT NULL,
  `view_date` int unsigned NOT NULL,
  `robot_key` varbinary(25) NOT NULL DEFAULT '',
  PRIMARY KEY (`user_id`,`unique_key`) USING BTREE,
  KEY `view_date` (`view_date`) USING BTREE,
  KEY `user_id_view_date` (`user_id`,`view_date`)
) ENGINE=InnoDB DEFAULT CHARSET=ascii COLLATE=ascii_bin

Differences from vanilla:
  • Engine is InnoDB
  • Collation, which shouldn't really affect anything here other than reducing the size of each row.
  • An additional index
It seems to be largely the result of crawlers that are requesting multiple pages simultaneously. Looking through MySQL's bugs, I'm noticing that they've been having a lot of trouble getting the locking behavior right for ON DUPLICATE KEY UPDATE, REPLACE INTO, and INSERT IGNORE lately. ON DUPLICATE KEY UPDATE seems to be affected more than the other two. The current behavior is still probably better than the full table locks used by MEMORY, but it's far from ideal for a large site. There has been quite a bit of thrashing: in one MySQL release the locks will be more aggressive; people complain and the fix gets reverted, which leads to insufficient locking, and the cycle repeats.

xf_session_activity doesn't need to be perfect, especially for guests. If someone loads multiple pages simultaneously, it matters very little which ends up running the query last--since it runs post-dispatch and doesn't check that the new timestamp is ahead of the old timestamp, requests that take longer to serve can already overwrite data from newer requests that didn't take as long to execute.

This issue could likely be resolved by running three separate queries outside of a transaction--there's no need for this to be atomic:
  1. SELECT to check whether it already exists. As part of this query, retrieve view_date.
  2. If view_date is less than a second ago, don't bother performing any sort of write. Stop here.
  3. If the record exists and view_date is old and close to the pruning cutoff, use REPLACE INTO.
  4. If the record exists and view_date is not close to the pruning cutoff, use UPDATE. Unlike REPLACE INTO, this is not affected by the locking bug in MySQL, and this will be the most common write performed on most large sites.
  5. If the record doesn't exist, use INSERT IGNORE.
Other than requiring a bit more logic, this won't change the behavior in any significant way, and it will relieve the locking pressure a bit.

Additional debugging info:
  • We're running MySQL 8.0.26.
  • We've ruled out issues with other queries that interact with xf_session_activity. The locks that are causing problems are the exclusive result of the query pasted above running multiple times in parallel with the same primary key.
 
  1. If view_date is less than a second ago, don't bother performing any sort of write. Stop here.
This can be much stronger to avoid unrequired writes, something like; $viewDate + 1 >= \XF::$time.

This covers if the recent session activity was in the last second, and covers the case of a older request finishing after a shorter running request. In either case logging to the session activity table is just a wasted write.
 
Last edited:
This can be much stronger to avoid unrequired writes, something like; $viewDate + 1 >= \XF::$time.

This covers if the recent session activity was in the last second, and covers the case of a older request finishing after a shorter running request. In either case logging to the session activity table is just a wasted write.
Hmm, I tend to somewhat disagree here.

We are using xf_session_activity to display who is currently viewing a thread or forum.
We previously did this using https://xenforo.com/community/resources/user-activity-by-xon.5567/ but got complaints from users that it was not realtime.

With the suggested change it could happen (although this is not very likely) that a user is viewing a certain content and does view another content within the same second.
If this happens, the second activity would not get updated to xf_session_activity and we most likely would get complaints again :(
 
With the suggested change it could happen (although this is not very likely) that a user is viewing a certain content and does view another content within the same second.
That's already the case. It's a race to postDispatch, not preDispatch. If a user requests two pieces of content at nearly the same time, whichever loads last will win, not whichever was requested last.

The proposal here is not a significant deviation from existing behavior and comes with a negligible decrease in precision.
 
That's already the case. It's a race to postDispatch, not preDispatch. If a user requests two pieces of content at nearly the same time, whichever loads last will win, not whichever was requested last.
Not if both requests are received serially.

With the current design a request received at X.6 would update session activity of a request received at X.1 and finished at X.5.
The suggested design change would always cause the second request to not update activity in this case.

If that deviation is insignificant or not is up to interpretation, as said before I am pretty sure our users would notice this.
 
With the current design a request received at X.6 would update session activity of a request received at X.1 and finished at X.5.
The suggested design change would always cause the second request to not update activity in this case.

Correct, it increases the error slightly, but not much. It's already fairly imprecise. Users won't be affected by this imprecision unless they're opening multiple tabs simultaneously, in which case we can safely say they're on multiple pages; there's no benefit to choosing one over the other.

The cost for maintaining the existing veil of precision isn't worth it. It necessitates a lot of locking and writing on high-traffic sites. Most sites are going to solve that by ignoring it or scaling vertically instead of resolving the underlying issue.
 
Thank you for reporting this issue, it has now been resolved. We are aiming to include any changes that have been made in a future XF release (2.2.8).

Change log:
Re-work session activity updates to reduce locking pressure
There may be a delay before changes are rolled out to the XenForo Community.
 
@Jeremy P took this one on and looks good to me.

If you're interested, this is the diff. Please review/test the changes to ensure it is what you had in mind:

Diff:
diff --git a/src/XF/Repository/SessionActivity.php b/src/XF/Repository/SessionActivity.php
index 7147dd19c0..763e93f5bb 100644
--- a/src/XF/Repository/SessionActivity.php
+++ b/src/XF/Repository/SessionActivity.php
@@ -142,21 +142,104 @@ public function updateSessionActivity($userId, $ip, $controller, $action, array
         $paramList = substr($paramList, 0, 100);
         $robotKey = substr($robotKey, 0, 25);
 
-        $this->db()->query("
-            -- XFDB=noForceAllWrite
-            INSERT INTO xf_session_activity
-                (`user_id`, `unique_key`, `ip`, `controller_name`, `controller_action`, `view_state`, `params`, `view_date`, `robot_key`)
-            VALUES
-                (?, ?, ?, ?, ?, ?, ?, ?, ?)
-            ON DUPLICATE KEY UPDATE ip = VALUES(ip),
-                controller_name = VALUES(controller_name),
-                controller_action = VALUES(controller_action),
-                view_state = VALUES(view_state),
-                params = VALUES(params),
-                view_date = VALUES(view_date),
-                robot_key = VALUES(robot_key)
-        ", [$userId, $uniqueKey, $binaryIp, $controller, $action, $viewState, $paramList, \XF::$time, $robotKey]);
         // TODO: swallow errors if upgrade is pending
+        $viewDate = $this->db()->fetchOne(
+            'SELECT view_date
+                FROM xf_session_activity
+                WHERE user_id = ? AND unique_key = ?',
+            [$userId, $uniqueKey]
+        );
+        if ($viewDate)
+        {
+            if ($viewDate >= \XF::$time - 1)
+            {
+                return;
+            }
+            else if ($viewDate >= \XF::$time - 3540)
+            {
+                $this->db()->query(
+                    '-- XFDB=noForceAllWrite
+                        UPDATE xf_session_activity
+                        SET
+                            ip = ?,
+                            controller_name = ?,
+                            controller_action = ?,
+                            view_state = ?,
+                            params = ?,
+                            view_date = ?,
+                            robot_key = ?
+                        WHERE user_id = ? AND unique_key = ?',
+                    [
+                        $binaryIp,
+                        $controller,
+                        $action,
+                        $viewState,
+                        $paramList,
+                        \XF::$time,
+                        $robotKey,
+                        $userId,
+                        $uniqueKey
+                    ]
+                );
+            }
+            else
+            {
+                $this->db()->query(
+                    '-- XFDB=noForceAllWrite
+                        REPLACE INTO xf_session_activity (
+                            user_id,
+                            unique_key,
+                            ip,
+                            controller_name,
+                            controller_action,
+                            view_state,
+                            params,
+                            view_date,
+                            robot_key
+                        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
+                    [
+                        $userId,
+                        $uniqueKey,
+                        $binaryIp,
+                        $controller,
+                        $action,
+                        $viewState,
+                        $paramList,
+                        \XF::$time,
+                        $robotKey
+                    ]
+                );
+            }
+        }
+        else
+        {
+            $this->db()->query(
+                '-- XFDB=noForceAllWrite
+                    INSERT IGNORE INTO xf_session_activity (
+                        user_id,
+                        unique_key,
+                        ip,
+                        controller_name,
+                        controller_action,
+                        view_state,
+                        params,
+                        view_date,
+                        robot_key
+                    )
+                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
+                [
+                    $userId,
+                    $uniqueKey,
+                    $binaryIp,
+                    $controller,
+                    $action,
+                    $viewState,
+                    $paramList,
+                    \XF::$time,
+                    $robotKey
+                ]
+            );
+        }
     }
 
     public function updateUserLastActivityFromSession()
 
The cost for maintaining the existing veil of precision isn't worth it. It necessitates a lot of locking and writing on high-traffic sites. Most sites are going to solve that by ignoring it or scaling vertically instead of resolving the underlying issue.
It might not be worth it for sites with a lot of traffic, but it is worth it for sites with low traffic where users do expect the current behaviour.

Making smth, that already isn't that precise even more unpresice is not a solution to fix the underlying issue (XenForo does treat every request by a guest from the same IP address as the same visitor), it's a workaround.

I am somewhat disappointed at the "solution" that is going to be implemented, I'll most likely have to revert this to the previous code via a class extension.

But as that is pretty easy it's not the end of the day :)
 
Please review/test the changes to ensure it is what you had in mind

Will do.

Making smth, that already isn't that precise even more unpresice is not a solution to fix the underlying issue (XenForo does treat every request by a guest from the same IP address as the same visitor), it's a workaround.

It's not really any more imprecise than it was previously, especially on sites that aren't trying to shave off milliseconds from response times. Furthermore, the old behavior opens the doors for someone to take down a smaller site by simply mashing F5.

It's also not a workaround: this is the best traditional RDBMSes have to offer without going overboard. The alternatives are:
  • Eliminate session activity altogether
  • Switch to a specialized DBMS optimized for handling this scenario
  • Create an append-only log of session activity and shift the burden of determining the most recent value to the reader
Synchronization of a large volume of writes in this manner is a hard task. It can either be cheap in terms of resources, or it can be precise. You can't have both. The MySQL team has been tuning that knob for years, yet it remains an issue--that's a testament to how difficult a problem it is. Oversimplifying it by restoring the old behavior may seem acceptable on the surface, but as soon as a user decides to reopen 150 tabs that they happen to have saved, you're going to run into issues.
 
I've attached two patch files. The first is Jeremy P's fix, but with the tab -> space conversion reverted so it can be applied easily with git apply. The second is an alternative that I wrote; it deduplicates some code, tries to mitigate the possibility of pruneExpiredActivityRecords and updateSessionActivity fallout out-of-sync, and adds a config option to restore the < 2.2.8 behavior preferred by @Kirby.

I haven't tested either patch yet.
 

Attachments

LGTM, though I haven't tested it thoroughly yet. You could do what @Xon suggested and move the test for $viewDate <= \XF::$time - 1 to the top of the function to cut down on some processing, but it'll be a pretty small optimization, so I'm fine with it the way it is.

@Kirby, I haven't tested the config option at all, so you may want to check that it behaves as expected on your end.
 
Top Bottom