Alert pruning leads to memory exhaustion on a busy forum

Affected version
2.2.6 Patch 1

MilkyMeda

Active member
On a busy forum with a very busy alert table -around 8m records- hourly cron task fails due to memory exhaustion. The responsible part of the task is the alert pruning. The memory usage easily exceeds 256 MB which is the limit I use which is not low. I was able to overcome this only by using @Xon's Alert Improvements add-on which has a solution for this.
 

Xon

Well-known member
pruneViewedAlerts/pruneUnviewedAlerts which call deleteAlertsInternal; have major structural weaknesses;

The function does not page through the data set it is working on. This leads to arbitrary sized fetches and deletes.

In detail;
  • deleteAlertsInternal fetches arbitrary number of alerts into php
    • Very high memory usage and php spending time computing in php what can be done in SQL
  • It deletes an arbitrary number of rows at once
    • This results in an in clause which can be arbitrarily sized.
    • This can cause very large transaction sizes which make deadlocks more risky.
  • N+1 user alert table updates
    • This is done inside a transaction pinning a large number of alerts a user could still potentially view and try to update.

Broadly speaking Alert Improvements does the following;
  • pruneXAlerts instead triggers a unique job
  • Records users it intends to update by inserting into a pending user rebuild table.
    • If a deadlock occurs reschedule the job for 1-5 seconds into the future!
  • Page through alerts deleting alerts deleting a fixed number of alerts per iteration
    • If a deadlock occurs reduce the batch size and reschedule the job for a second into the future.
  • On completing deleting alerts; trigger a user alert total rebuild job
The user alert total rebuild job does;
  • For each user in the pending rebuild table;
    • Update alert totals by counting all alerts for that user, which are not considered to be expired.
    • Update the user record
    • Delete the row from the pending rebuild table

One tricky bit is ensuring tables as always accessed in the same order, which leads to a little hoop jumping but ensures consistent transactional locking which reduces deadlocks.

By separating deleting alerts by expired type (unviewed vs viewed), and then rebuilding by user; this avoid pulling arbitrary data into php and ensure the batch size for delete operations are sanely scoped.

Explicit deadlock detection handling is used to reschedule to attempt to avoid avoid hotspots due to user-activity, as trying again immediately tends to cause more deadlocks.

Another deadlock avoidance method is to explicitly scope selects/updates to only consider non-expired alerts. This ensures the user can't somehow trigger row locks on data that the system will want to delete. And gives the appearance everything is still working even if the data hasn't been purged yet.
 
Top