Not a bug Gallery appears to be always writing new entries to xf_search table rather than re-using cache

jeffwidman

Active member
I admin a series of sites for someone. One of the sites has the xf_search table blowing up--it's 17.4GB by itself, which makes no sense because the site isn't that highly traffic'd.

I dug a little deeper:
Code:
SELECT 
    search_type,
    COUNT(*)
FROM
    mavicpilots_forum_xf.xf_search
GROUP BY search_type
returns
Code:
'','542'
'new-posts','539'
'new-profile-posts','197726'
'post','309'
'recent-posts','407712'
'thread','110'
'unanswered-threads','16'
'user','31'
'xengallery_media','986352'

The last entry in particular sticks out--why is xengallery_media so have nearly 1M entries? Yet the gallery homepage for the site shows only 597 uploaded media count. Something is off there.

Manually inspecting a few records of that search type shows all queries appear to be run by userid 0 which I assume is a guest user. And the search_query column is empty. Doublechecking how many of these entries came from logged-in users:
Code:
SELECT 
    COUNT(*)
FROM
    mavicpilots_forum_xf.xf_search
WHERE
    search_type = 'xengallery_media'
        AND user_id != 0
returns 4.

So that means nearly 1M entries were generated in the last 24 hours by guest users.

I thought the xf_search table was intended a cache, so it should re-use the result when possible... not be logging all hits. Additionally, the query_hash value is the same for all these entries.

I'm going to file a private support ticket to see if you want to login to this particular site and further debug, but wanted to post a public bug report so that it can be tracked.
 
@Chris D looked into this and discovered the entries appear to be Googlebot hitting the find-new page repeatedly.

This brings up a few followup questions in my head, and posting here so that the answers are available to others in the future.

Maybe I'm misunderstanding the purpose of the xf_search table, but I thought it was intended to cache the output of a search so that constant traffic on the find-new page would serve cached results. If that's true, why are we getting so many entries? I thought this table was expressly designed to deal with just this problem.

If you were caching results per user, then it makes sense that each user generates new rows (although I don't think such a cache would have a high hit-rate), but these are all logged-out users who are getting the same result.

Otherwise, if it's effectively just a log of searches, then I don't understand the design goal.

Why not log the searches to a text file?

If you need to keep an active count of searches to display statistics, then why not just increment counters rather than writing an entire row every time?

Ideally increment a counter in redis and then dump it every hour to MySQL, although I know you can't expect most folks to be running redis...

I know you guys are a smart crew, so what am I missing?
 
Last edited:
The xf_search table does work like that for traditional searches. If you search for a term and then go back and then search for the same term again, you should notice the search ID doesn't increment.

The Find new system utilises the xf_search table for a different reason. It's so that when you get your paginated list of find new results, you are working with exactly the same set of results on page 1 as you are on page 10. If we didn't do that then you'd read page 1 and you'd go to page 2 and on a really busy board half of the results from page 1 would have shifted to page 2 because page 1 now contains a load of new content.

Bear in mind a find new look up isn't actually really a search at all so it's not really possible to say conclusively that two lookups are going to display the same results. The whole idea is we show the newest content so it's not really possible to cache those over different requests for even the same user.

So as I said, using the search table in this way does have its uses and unless Googlebot is accessing it 10-20 times per second you're really not going to ever experience any issues. It's worth noting, perhaps, that the site in question doesn't even seem to be experiencing any issues apart from a few (!) extra records in a table. This table is pruned periodically or if you're otherwise concerned it's safe for you to truncate it if you wish (as long as you have no other non default functionality using it).

We have changed things slightly in XF2. We no longer piggyback the search system to achieve the same effect and we do have an element of caching here. Even so, the cache lifetime - due to the nature of what find new is trying to do - is actually only 15 seconds. It would only partially mitigate what Googlebot is doing here so anything you can do to mitigate that directly is still the recommended solution.
 
Back
Top Bottom