Fixed Bug: Cron Job runner generates infinite loop in some circumstances

Sim

Well-known member
Affected version
2.2b1 but also affects 2.0 and 2.1 with less impact
While developing a new version of my CLI JobRunner with enhanced functionality beyond that provided by the core xf:run-jobs command, I kept coming across random periods where the code ran indefinitely until some other limit was reached, consuming large amounts of CPU while doing so.

After literally 3 days (!!!) of debugging and tracing through code, I've identified a bug in the core which affects both xf:run-jobs and the activity based job trigger for cron tasks.

Some notes:
  • Bug has existed since 2.0
  • activity based cron would very rarely experience any issue and is unlikely it would be noticed if it did
  • infinite loop occurs only with server based CLI job runner (refer also to bug report: https://xenforo.com/community/threads/xf-run-jobs-may-never-stop.183590/ ) - fixing that bug would mitigate the impact of this bug somewhat, but not resolve it completely.
  • circumstances required to reproduce the bug are extremely difficult to set up - hence very difficult to debug
  • might go unnoticed on a production server, but could potentially cause high server load
Cause:

In XF\Job\Cron, the cron entries to execute are selected using a finder query:

PHP:
        $entries = $this->app->finder('XF:CronEntry')
            ->whereAddOnActive()
            ->where('active', 1)
            ->where('next_run', '<', \XF::$time)
            ->order('next_run');

The where('next_run', '<', \XF::$time) code is the problem because it will not execute cron entries where the next run time is exactly equal to the current \XF::$time

When the job manager selects jobs to run via XF\Job\Manager::getRunnable() - it selects jobs which are <= \XF::$time ... which can see us sometimes get into the situation where we've got runnable jobs (including the Cron job), but we never actually execute any cron entries (because we only execute cron entries earlier than \XF::$time), which means that the next runnable jobs never get updates and we continue back through our loop, continually executing the Cron job, but never executing any cron tasks.

This only manifests itself when the next run time on the cron entries happens to exactly match the current \XF::$time - which on a production server will potentially be somewhat random and occur infrequently (especially when jobs are triggered via the activity trigger rather than server trigger).

However, on my dev machine, running unix cron with almost no additional load - the xf:run-jobs command executes very consistently and so matches are frequent (but still random enough to make debugging this incredibly painful).

If you want to try and reproduce this problem, I can create an addon for you with full tracing code included to demonstrate the issue - but there's still no guarantee that this will trigger the issue in your environment! It requires you to write large amounts of debug tracing information to log files and then wait for the problem to manifest (which will be evidenced by the log file that is several MB in size rather than the usual several KB in size) - which can be very time consuming.

Or you can just take my word for it (as I said, I've literally spent 3 days debugging this!!!) and simply change the < in the above finder query to <=, which has so far solved the issue on my dev machine.

Fix:

In XF\Job\Cron lines 14-18, change finder query to the following:

PHP:
        $entries = $this->app->finder('XF:CronEntry')
            ->whereAddOnActive()
            ->where('active', 1)
            ->where('next_run', '<=', \XF::$time)
            ->order('next_run');

Note that when combined with the following bug: https://xenforo.com/community/threads/xf-run-jobs-may-never-stop.183590/ ... these two bugs can see the xf:run-jobs command run in a loop indefinitely, stopping only after more than 1 minute when a second xf:run-jobs command begins to execute - which is exactly the reason I suggested limiting the max run time in that other bug report - to avoid issues like this (even though this one was caused by a bug - it could conceivably be caused by a legitimate job).
 
(ironically, this bug has made it easier to test certain functionality of my new JobRunner addon :rolleyes: )
 
Thank you for reporting this issue, it has now been resolved. We are aiming to include any changes that have been made in a future XF release (2.2.0 Beta 2).

Change log:
Allow cron entries scheduled to run at the current timestamp to run accordingly
There may be a delay before changes are rolled out to the XenForo Community.
 
Top Bottom