Fixed New thread and post indexed on duplicate entry in RSS Feed

Camainer

Member
When processing an RSS Feed that has duplicate entries with the same id, no duplicate thread is posted but a new thread and post are indexed. This happens everytime the feed is processed.
The result of this is that the search index grows insanely much (2 new index entries for every duplicate entry everytime the feed is processed) and the post/thread ids are a lot higher than they should be (eg on a forum with 5.000 threads and 60.000 posts, the ids have reached respectively 115.000 and 170.000).

There are 2 problems in the Xenforo_Model_Feed class that contribute to this:
  • _checkProcessedEntries(array $feedData, array $feed) only keeps track of last entry per id when checking for duplicates. If multiple entries have the same id, only the last one will be removed from the list.
  • _insertFeedEntry(array $entryData, array $feedData, array $feed) somehow still triggers indexing the thread and post as well as increasing the ids, despite everything being rolled back no actual thread/post being stored.
One of the problematic feeds is https://forums.daybreakgames.com/ps2/index.php?forums/game-update-notes.73/index.rss
There are 2 entries with id http://forums.daybreakgames.com/ps2/index.php?threads/pts-patch-notes-3-11.216825/
(Note that these are at the bottom, so it is entirely possible that they will no longer be included in the feed soon)

While this is a problem caused by broken feeds and they should fix their stuff, it would be nice if Xenforo was able to deal with this and not blow up its own database.
 
So I looked into this:
  • Xenforo_Model_Feed::_checkProcessedEntries does not take duplicate entries in the original feed into account:
    • I've created a patch to fix this:
      Code:
      --- library/XenForo/Model/Feed.php
      +++ library/XenForo/Model/Feed.php
      @@ -399,7 +399,11 @@
       
               foreach ($feedData['entries'] AS $i => &$entry)
               {
      -            $ids[$entry['id']] = $i;
      +            if (!isset($ids[$entry['id']]))
      +            {
      +                $ids[$entry['id']] = [];
      +            }
      +            $ids[$entry['id']][] = $i;
       
                   $entry['hash'] = md5($entry['id'] . $entry['title'] . $entry['content_html']);
               }
      @@ -420,7 +424,10 @@
               {
                   if (isset($ids[$id]))
                   {
      -                unset($feedData['entries'][$ids[$id]]);
      +                foreach ($ids[$id] as $entryId)
      +                {
      +                    unset($feedData['entries'][$entryId]);
      +                }
                   }
               }
  • If the transaction rolls back, the new entry in the search index does not get rolled back
  • If the transaction rolls back, the increased thread id is not rolled back
Only the first problem can be fixed (perhaps the second as well if InnoDB has evolved enough that it can be used for search as well), but at least it will prevent the other 2 problems from occuring (at least in these cases).
 
This is fixed a bit differently, but fixed nonetheless. If we find a feed entry with an ID we've already seen, we'll just skip the subsequent entry entirely (which is normally what would happen anyway).
 
Top Bottom