Thread Tagging Improvements

Thread Tagging Improvements 1.0.8 Patch Level 1

No permission to download
Mazzly provided free coding for this addon but you think "he doesn't care".

That's a weird take, bro.

Q: Should you expect him to keep donating his time to this addon ?
A: Probably not !
 
I am encountering a blocking error in the server error logs that is causing the job queue to get stuck. It appears that the collectWordsCron task is attempting to save an excessive amount of data into the xf_job table, exceeding the column limit.

The Error:

Code:
XF\Db\Exception: MySQL query error [1406]: Data too long for column 'execute_data' at row 1 src/XF/Db/AbstractStatement.php:230
Stack Trace:

Code:
UPDATE  `xf_job` SET `execute_data` = ?, `trigger_date` = ?, `last_run_date` = ? WHERE job_id = ?<br>------------<br>#0 src/XF/Db/Mysqli/Statement.php(207): XF\Db\AbstractStatement-&gt;getException('MySQL query err...', 1406, '22001')<br>#1 src/XF/Db/Mysqli/Statement.php(83): XF\Db\Mysqli\Statement-&gt;getException('MySQL query err...', 1406, '22001')<br>#2 src/XF/Db/AbstractAdapter.php(96): XF\Db\Mysqli\Statement-&gt;execute()<br>#3 src/XF/Db/AbstractAdapter.php(323): XF\Db\AbstractAdapter-&gt;query('UPDATE  `xf_job...', Array)<br>#4 src/XF/Job/Manager.php(226): XF\Db\AbstractAdapter-&gt;update('xf_job', Array, 'job_id = ?', 18133810)<br>
Analysis:

I inspected the problematic row in the xf_job table. The job class is MaZ\PotentialTagHelper:RebuildPotentialTagWords.

The execute_data column for this specific job has grown to 13.9 MiB, which is too large for the standard MEDIUMBLOB column type in XenForo (and generally too large for a job state payload).
  • Job ID: 18133810
  • Unique Key: collectWordsCron
  • Execute Class: MaZ\PotentialTagHelper:RebuildPotentialTagWords
  • Execute Data Size: ~13.9 MiB (BLOB)
It seems the add-on is trying to carry over too much data between job steps. I had to manually delete the job and disable the add-on to clear the queue.
 
The Fix Explanation

The error MySQL query error [1406]: Data too long for column 'execute_data' occurs because of a fundamental architectural issue in how the add-on handles data.

Originally, the add-on collected all words from thousands of threads into a single PHP array (word_count) and tried to carry this massive array between job steps via the database. Since your forum is large, this array grew to \~13.9 MB, exceeding the XenForo execute_data column limit.

I have refactored src/addons/MaZ/PotentialTagHelper/Job/RebuildPotentialTagWords.php to switch from a "Keep All -\> Write Later" logic to a "Process Batch -\> Write to DB -\> Flush Memory" logic.

Changes Made:

1. Flush Buffer: Instead of holding words in memory indefinitely, the script now writes to the database and clears memory whenever the word count in RAM reaches 2,000.
2. Early Filtering: Word length and stop word checks are now done inside the loop before writing to the DB, reducing load.
3. Start-up Cleanup: The logic to clear the table has been moved to the start of the job rather than the end to ensure a clean slate.

The Fixed Code
Replace the content of src/addons/MaZ/PotentialTagHelper/Job/RebuildPotentialTagWords.php with this:

PHP:
\<?php

namespace MaZ\\PotentialTagHelper\\Job;

use MaZ\\PotentialTagHelper\\PTH;
use XF\\Job\\AbstractRebuildJob;
use XF\\Job\\JobResult;

class RebuildPotentialTagWords extends AbstractRebuildJob
{
protected $stopWords = null;

```
// Maximum number of words to keep in memory before writing to the database.
// This number is small enough not to exceed the BLOB limit (error 1406), but large enough for performance.
protected $bufferSize = 2000;

protected $defaultData = [
    &#39;batch&#39; =&gt; 1000, // Batch size reduced for more frequent checks
    &#39;start&#39; =&gt; [&#39;thread_id&#39; =&gt; 0, &#39;title&#39; =&gt; &#39;&#39;],
    &#39;threads_remaining&#39; =&gt; 0,
    &#39;word_count&#39; =&gt; array(),
    &#39;is_reset&#39; =&gt; false // Checks if the table has been reset
];

public function calculateOptimalBatch($expected, $done, $start, $maxTime, $maxBatch = null)
{
    return parent::calculateOptimalBatch($expected, $done, $start, $maxTime, \XF::options()-&gt;pth_max_batch_size);
}

protected function getNextIds($thread, $batch)
{
    $threadFinder = \XF::finder(&#39;XF:Thread&#39;);
    $limit_forums = \XF::options()-&gt;pth_forums;
    if ($limit_forums &amp;&amp; $limit_forums[0] !== &quot;0&quot;) {
        $threadFinder-&gt;where(&#39;node_id&#39;, $limit_forums);
    }
    $threadFinder-&gt;limit($batch)
        -&gt;where(&#39;thread_id&#39;, &#39;&gt;&#39;, $thread[&#39;thread_id&#39;])
        -&gt;where(&#39;discussion_state&#39;, &#39;visible&#39;)
        -&gt;where(&#39;discussion_type&#39;, &#39;!=&#39;, &#39;redirect&#39;)
        -&gt;order(&#39;thread_id&#39;, &#39;ASC&#39;);
    return $threadFinder-&gt;fetchColumns(&#39;thread_id&#39;, &#39;title&#39;);
}

protected function rebuildById($thread)
{
    // 1. Clear the table when the process starts (Old logic did this at the end)
    if (empty($this-&gt;data[&#39;is_reset&#39;]) &amp;&amp; $this-&gt;data[&#39;start&#39;][&#39;thread_id&#39;] == 0) {
        $this-&gt;app-&gt;db()-&gt;delete(PTH::table_potential_words, &#39;ignored = 0&#39;);
        $this-&gt;data[&#39;is_reset&#39;] = true;
    }

    $words = PTH::splitWord(strtolower($thread[&quot;title&quot;]));
  
    // Count individual words
    foreach (array_count_values($words) as $word =&gt; $count) {
        if (!isset($this-&gt;data[&#39;word_count&#39;][$word])) $this-&gt;data[&#39;word_count&#39;][$word] = 0;
        $this-&gt;data[&#39;word_count&#39;][$word] += $count;
    }

    // Count word chunks
    if (count($words) &gt; 1) {
        $chunk_size = 2;
        while ($chunk_size &lt;= \XF::options()-&gt;pth_potential_word_size) {
            foreach ($this-&gt;getChunks($words, $chunk_size) as $word) {
                if (!isset($this-&gt;data[&#39;word_count&#39;][$word])) $this-&gt;data[&#39;word_count&#39;][$word] = 0;
                $this-&gt;data[&#39;word_count&#39;][$word] += 1;
            }
            $chunk_size++;
        }
    }

    // 2. If the word count in memory exceeds the limit, write to DB and clear RAM
    // This definitely resolves the &quot;Data too long&quot; error.
    if (count($this-&gt;data[&#39;word_count&#39;]) &gt;= $this-&gt;bufferSize) {
        $this-&gt;flushBuffer();
    }
}

/**
 * Writes words in memory (RAM) to the database and resets the array.
 */
protected function flushBuffer()
{
    if (empty($this-&gt;data[&#39;word_count&#39;])) {
        return;
    }

    // Filtering settings
    $min_word_length = max(\XF::options()-&gt;pth_minimum_word_length, 2);
    // We set minimum use count to 1 during flush because
    // we don&#39;t know the total count yet while writing piecemeal. We will clean up at the end.
  
    // Prepare stop words list
    if (!$this-&gt;stopWords) {
        $this-&gt;stopWords = array_flip(preg_split(&quot;/\r\n|\n|\r/&quot;, \XF::options()-&gt;pth_stop_words));
    }

    $insertRows = [];
    $existing_tags_synonyms = PTH::repo()-&gt;existingTagsOrSynonyms();

    foreach ($this-&gt;data[&#39;word_count&#39;] as $word =&gt; $count) {
        // We do basic filtering HERE (to prevent database bloat)
        if (gettype($word) !== &quot;string&quot;) continue;
        if (strlen($word) &lt; $min_word_length || strlen($word) &gt; 100) continue;
        if (isset($this-&gt;stopWords[$word])) continue;

        $insertRows[] = [
            &#39;word&#39; =&gt; $word,
            &#39;word_count&#39; =&gt; count(explode(&#39; &#39;, $word)),
            &#39;use_count&#39; =&gt; $count,
            &#39;added&#39; =&gt; array_key_exists($word, $existing_tags_synonyms) ? 1 : 0
        ];
    }

    if (!empty($insertRows)) {
        $db = $this-&gt;app-&gt;db();
        // Insert into DB with bulk insert, increment count if exists (ON DUPLICATE KEY UPDATE)
        foreach (array_chunk($insertRows, \XF::options()-&gt;pth_bulk_insert_chunk_size) as $insertChunk) {
            $db-&gt;insertBulk(PTH::table_potential_words, $insertChunk, false, &quot;use_count = use_count + VALUES(use_count)&quot;);
        }
    }

    // Clear RAM! (The life-saving line)
    $this-&gt;data[&#39;word_count&#39;] = [];
}

protected function getChunks($words, $chunk_size)
{
    if (!$this-&gt;stopWords) {
        $stop_words = preg_split(&quot;/\r\n|\n|\r/&quot;, \XF::options()-&gt;pth_stop_words);
        $this-&gt;stopWords = array_flip($stop_words);
    }
    $startIndex = 0;
    $maxStartIndex = count($words) - $chunk_size;
    while ($startIndex &lt;= $maxStartIndex) {
        $chunk = array_slice($words, $startIndex, $chunk_size);
        $add = true;
        if (!\XF::options()-&gt;pth_multiword_allow_stop_words) {
            foreach ($chunk as $index =&gt; $word) {
                if (strlen($word) &lt; 2 || array_key_exists($word, $this-&gt;stopWords)) {
                    $startIndex += $index + 1;
                    $add = false;
                    break;
                }
            }
        }
        if ($add) {
            $startIndex++;
            yield implode(&#39; &#39;, $chunk);
        }
    }
    return [];
}

public function complete(): JobResult
{
    // Write the remaining data to the database
    $this-&gt;flushBuffer();

    $db = $this-&gt;app-&gt;db();
  
    // 3. Clean up words below the minimum use count from the database
    // (Formerly done on PHP side, now done via SQL)
    $min_use_count = max((\XF::options()-&gt;pth_min_use_count), 1);
    if ($min_use_count &gt; 1) {
         $db-&gt;delete(PTH::table_potential_words, &quot;use_count &lt; ?&quot;, [$min_use_count]);
    }
  
    // Optional: If there is too much data, keep the top 50,000 popular ones (Performance optimization)
    // This query might be heavy, can be disabled if needed but keeps the table clean.
    /*
    $db-&gt;query(&quot;DELETE FROM &quot; . PTH::table_potential_words . &quot;
                WHERE word NOT IN (
                    SELECT word FROM (
                        SELECT word FROM &quot; . PTH::table_potential_words . &quot;
                        ORDER BY use_count DESC LIMIT 50000
                    ) as kept_words
                ) AND ignored = 0&quot;);
    */

    return parent::complete();
}

public function getStatusMessage()
{
    return &quot;Counting words from thread titles (&quot; . $this-&gt;data[&#39;start&#39;][&#39;thread_id&#39;] . &quot;)&quot;;
}

public function canCancel()
{
    return true;
}

public function canTriggerByChoice()
{
    return true;
}

protected function getStatusType()
{
}
```

}
 
Back
Top Bottom