Fixed New thread and post indexed on duplicate entry in RSS Feed

Camainer · Apr 8, 2017

When processing an RSS Feed that has duplicate entries with the same id, no duplicate thread is posted but a new thread and post are indexed. This happens everytime the feed is processed.
The result of this is that the search index grows insanely much (2 new index entries for every duplicate entry everytime the feed is processed) and the post/thread ids are a lot higher than they should be (eg on a forum with 5.000 threads and 60.000 posts, the ids have reached respectively 115.000 and 170.000).

There are 2 problems in the Xenforo_Model_Feed class that contribute to this:

_checkProcessedEntries(array $feedData, array $feed) only keeps track of last entry per id when checking for duplicates. If multiple entries have the same id, only the last one will be removed from the list.
_insertFeedEntry(array $entryData, array $feedData, array $feed) somehow still triggers indexing the thread and post as well as increasing the ids, despite everything being rolled back no actual thread/post being stored.

One of the problematic feeds is https://forums.daybreakgames.com/ps2/index.php?forums/game-update-notes.73/index.rss
There are 2 entries with id http://forums.daybreakgames.com/ps2/index.php?threads/pts-patch-notes-3-11.216825/
(Note that these are at the bottom, so it is entirely possible that they will no longer be included in the feed soon)

While this is a problem caused by broken feeds and they should fix their stuff, it would be nice if Xenforo was able to deal with this and not blow up its own database.

Camainer · May 27, 2017

So I looked into this:

Xenforo_Model_Feed::_checkProcessedEntries does not take duplicate entries in the original feed into account:

I've created a patch to fix this:

Code:

--- library/XenForo/Model/Feed.php
+++ library/XenForo/Model/Feed.php
@@ -399,7 +399,11 @@
 
         foreach ($feedData['entries'] AS $i => &$entry)
         {
-            $ids[$entry['id']] = $i;
+            if (!isset($ids[$entry['id']]))
+            {
+                $ids[$entry['id']] = [];
+            }
+            $ids[$entry['id']][] = $i;
 
             $entry['hash'] = md5($entry['id'] . $entry['title'] . $entry['content_html']);
         }
@@ -420,7 +424,10 @@
         {
             if (isset($ids[$id]))
             {
-                unset($feedData['entries'][$ids[$id]]);
+                foreach ($ids[$id] as $entryId)
+                {
+                    unset($feedData['entries'][$entryId]);
+                }
             }
         }

If the transaction rolls back, the new entry in the search index does not get rolled back
- The search-index table uses MyISAM, which does not support transactions. As such, anything entered in here does not get rolled back.
- Switching to InnoDB would solve this, but there have been reports that the search function does not behave as expected when doing this: https://xenforo.com/community/threads/search-table-innodb-does-not-work-properly.105839/
If the transaction rolls back, the increased thread id is not rolled back
- I should have realized this is normal and wanted behaviour for transactions: https://stackoverflow.com/a/449387

Only the first problem can be fixed (perhaps the second as well if InnoDB has evolved enough that it can be used for search as well), but at least it will prevent the other 2 problems from occuring (at least in these cases).

Mike · Jun 1, 2017

This is fixed a bit differently, but fixed nonetheless. If we find a feed entry with an ID we've already seen, we'll just skip the subsequent entry entirely (which is normally what would happen anyway).

Fixed New thread and post indexed on duplicate entry in RSS Feed

Camainer

Member

Camainer

Member

Mike

XenForo developer

We value your privacy