Not a bug XenForo Sitemap File

Garfield™

Member
Affected version
2.2.13
XenForo Sitemap File has 50,000 urls, which is not very popular and accepted in terms of google and seo. While this number is limited to 300-500 in all other infrastructures, it is a pity that there is no such option in xenforo. On the other hand, opening and accessing a sitemap file with 50,000 URLs is a serious problem for small and medium-sized servers. I hope there will be a nice improvement in the next update.
 
The number of URLs in the sitemap has no effect on SEO. I've worked with many sites (including my own) that had far more than that.

It shouldn't impact anything other than the weakest of servers, but any server that can't serve a 50K site map is likely underpowered for that many topics.
 
The 50,000 URL figure is actually pretty popular and follows Google's own recommendations on best practices. The resulting file shouldn't be more than a few megabytes after compression, which isn't much larger than a typical image attachment. I wouldn't expect a even a small server to struggle much with it, and if it did I think you'd have bigger issues at hand.
 

Earlier, the default value for this setting was 1000. But, we’ve observed that Google responds better to smaller pages in sitemaps. Thus, we’ve reduced the value to 200, which is our recommended setting.

this is just wordpress, I didn't need to check other infrastructure software because when I split my sitemap file, I got better results bsee;

 

Attachments

  • Screenshot_3.webp
    Screenshot_3.webp
    27.9 KB · Views: 38
I haven't personally encountered any issues with larger sitemaps, even on a micro-VPS. What issues were you encountering before?

1690053607820.png

XenForo pre-generates the sitemaps via the cron system at regular intervals, so the server has already done the work up front. We might reconsider if there is compelling evidence that lowering the limit helps something but, short of the server load issues which shouldn't apply to XF, the linked blog post doesn't provide much data to support the claim that it makes any noticeable difference to Google, and the canonical sources all say 50,000 should be no issue:

https://www.sitemaps.org/protocol.html#index
https://developers.google.com/searc...sitemaps/build-sitemap#sitemap-best-practices
https://blogs.bing.com/webmaster/2014/06/09/sitemaps-best-practices-including-large-web-sites/
 
I've investigated this further upon reading this thread from @dethfire, who helpfully noted that Chrome appeared to run out of memory when trying to load his sitemap. Sure enough, Chrome takes over 2GB of memory to load https://xenforo.com/community/sitemap-1.xml, despite being a 6.6MB file. Other browsers don't fare much (if at all) better. So the performance issues appear to be client-side and not server-side.

As far as I know, Google uses headless Chrome on commodity hardware for most crawling. My working theory is, if Google is using Chrome to parse the sitemap, the overhead of invoking the DOM can, in fact, cause issues. If that's true, we might be wrong to trust the official documentation and it could be worth revising the limit.
 
It's been my understanding they use headless Chrome only for crawls requireing rendering, such as those involved in pagespeed metrics used in search ranking, AdSense ads automatic insertion and determining if non-auto ads are above the fold.

You'll see the difference in web logs. "Plain" Googlebot will just have Mozilla 5.0 in the user agent. The Chrome Googlebot will include the phone or tablet device id like this:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/114.0.5735.179 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

And desktop similar to this:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/114.0.5735.179 Safari/537.36

Here's the plain Googlebot for robots.txt and sitemap fetches:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
 
It did seem ridiculously excessive to invoke headless Chrome just to parse some XML, but I wasn't sure if it was just inherent to their crawling infrastructure at this point. We had run into issues with some other Google tools a few years ago, which wound up being the result of headless Chrome crashing due to a bug on their end.

Back to the drawing board I guess :)
 
Actually, I didn't mention the problem of accessing the sitemap file, but if I had to mention it, I had a problem with accessing the sitemap file of the forums with 50,000 urls in the forums where I changed the domain, yes, google looks at it like this, how can a new domain name have 50,000 urls with one page at a time?

My main question and problem is that in the sitemap file, I got much more successful results in terms of google and seo in infrastructures with few url pages, in xenforo at least this sitemap url number should be left to user preference.
 
It did seem ridiculously excessive to invoke headless Chrome just to parse some XML, but I wasn't sure if it was just inherent to their crawling infrastructure at this point. We had run into issues with some other Google tools a few years ago, which wound up being the result of headless Chrome crashing due to a bug on their end.

Back to the drawing board I guess :)

Google has a bad history of incomplete, outdated and vague documentation.

The main reason I know about these bots is I created a comprehensive robots.txt management tool that can be used for any type of site since it's not XF specific. It allows me to define different sets of robots.txt files that are sent based on the user-agent string, IP address, or IP-range. Basically, it's a dynamic robots.txt generator based on what is fetching. It also supports defining the crawl rate and auto-banning unknown bots in .htaccess when they violate robots.txt.

Over many years I've have cataloged hundreds, including categorizing what each is used for.
 
Back
Top Bottom