XF 1.5 internal_data/sitemaps was over 650 files

dvsDave

Well-known member
So, I got a disk space warning on my server and started looking into what folders were the biggest offenders. To my surprise, the sitemaps folder was HUGE (like 22 gigs) There were a bunch of .gz files that were old, so I deleted those. I then wiped out the whole folder and went to Tools->Rebuild Caches and went to rebuild the sitemap.

Four Hours Later... I have over 650 5mb xml files and it's still rebuilding the Tags for the sitemap. We used to have a system that auto-generated tags and as a result, we have a LOT of tags across 35k discussions. When I woke up this morning, I had a 3 gig sitemap folder and the rebuild tool had timed out at some point.

So, I wiped the sitemap folder again, changed the XML Sitemap Generation option to exclude tags, then rebuild the sitemap with the Rebuild Caches tool. This time it took only a minute and my sitemap folder was only about 3 megs.

I guess I want to know if this is normal? Should I reinstate the tags and just not worry about the size, or should I keep excluding the tags? Do the tags make any appreciable SEO difference or does Google ding me for having an insane sitemap folder?
 
It's actually only 270 tags.

So, just to clarify, I'm not talking about the /sitemap folder, but the /internal_data/sitemaps/ folder where I was seeing these crazy figures.
 
Oh hang on a minute... there's a lot of repetition in there.

Halloween appears 187 times.

I'm going to look at this in more detail, from the code. First:

Can you confirm the result of this query:
Code:
SELECT COUNT(*) FROM xf_tag WHERE use_count > 0

Also can you confirm whether any add-ons are involved? Either were the tags imported from another add-on? Or are the tags themselves part of an add-on? e.g. is some of these repeated tags pointing to content that belongs to an add-on? It shouldn't matter, but worth checking.
 
Sorry all, my 16 month old just pushed my laptop off the table and now the keyboard won't work (typing this on my cell). This has happened before and it's a known issue on my asus laptop, just have to open it up and reseat the keyboard connection. Will get the results of the query as soon as I can find the bizzare size torx driver I keep for just these occasions.
 
Back up and running! The query returned 267

The biggest suspected culprit is VaultWiki. I just noticed a server error message:

Server Error Log
Error Info
ErrorException: Undefined index: user - vault/core/controller/ui/integrate/tag/xf.php:35
Generated By: Unknown Account, Today at 10:06 AM
Stack Trace
#0 /home/control/public_html/vault/core/controller/ui/integrate/tag/xf.php(35): XenForo_Application::handlePhpError(8, 'Undefined index...', '/home/control/p...', 35, Array)
#1 /home/control/public_html/vault/core/controller/ui/integrate/vw.php(242): vw_UI_Integrate_Tag_Controller_XF->get_stack(false)
#2 /home/control/public_html/vault/core/controller/ui/integrate/vw.php(66): vw_UI_Integrate_Controller->get_stack(false)
#3 /home/control/public_html/vault/core/controller/ui/integrate/vw.php(42): vw_UI_Integrate_Controller->setup()
#4 /home/control/public_html/library/vw/XenForo/CodeEventListener/Public.php(234): vw_UI_Integrate_Controller->integrate('<!DOCTYPE html>...')
#5 [internal function]: vw_XenForo_CodeEventListener_Public::front_controller_post_view(Object(XenForo_FrontController), '<!DOCTYPE html>...')
#6 /home/control/public_html/library/XenForo/CodeEvent.php(90): call_user_func_array(Array, Array)
#7 /home/control/public_html/library/XenForo/FrontController.php(183): XenForo_CodeEvent::fire('front_controlle...', Array)
#8 /home/control/public_html/index.php(13): XenForo_FrontController->run()
#9 {main}
Request State
array(3) {
["url"] => string(38) "http://www.controlbooth.com/tags/xl16/"
["_GET"] => array(0) {
}
["_POST"] => array(0) {
}
}

I've submitted a bug report to him: https://www.vaultwiki.org/issues/4443/

I also used to use cemzoo's sitemap generator(linked to discussion, since the resource has been pulled), but I disabled that after I upgraded to 1.5 (forgot to turn it off when I went to 1.4)
 
I'm intrigued what would happen if you did the following:
  • Disabled all add-ons
  • Disable all sitemap content types except Tags
  • Rebuild the sitemap manually from "Rebuild Caches" page
Theoretically, 267 tags would be completed in mere seconds. If it is completed quickly, without any add-ons enabled, then it might confirm that an add-on is responsible. Then you could keep trying it again with different add-ons enabled to confirm which is doing it.
 
So, I disabled VaultWiki and ran the rebuild sitemap tool with only tags enabled and that took 2 seconds.

I then Re-enabled vaultwiki with only Tags enabled and that just started churning thru data.

tags_id_screenshot.webp

I then disabled Tags in the sitemap generation settings and rebuilt again.

Took about 30 seconds this time.
 
Patch instructions here: https://www.vaultwiki.org/issues/4444/#note24299
Update: Patch 4.0.7 PL 1 released.
Fixes:
  • Tag Duplication Vulnerability
  • Template Expansion Vulnerability
  • Template Usage Vulnerability
  • Node Overload Vulnerability
All of these were related to Denial of Service (with the tag-duplication reported in this thread, an attacker didn't have to be involved).
Official disclosures should be forthcoming by week's end.
 
Last edited:
That patch and one other patch from Vaultwiki did the trick. :)

Apparently this is what was happening:
Eventually tracked it down to Vaultwiki breaking sitemap generation during the tags step. When it took down the server, it'd created 43GB of temp sitemap files, when that's normally < 10MB. It would have continued to try to generate the temp files except it ran out of space. The sitemap log shows it created 8,202 files tracking 410+ million urls on a site that has nowhere near that.

I just now implemented the fix for the 'Undefined Index: user' issue, hoping that would fix this as well but it didn't...

When I run build sitemap with Vaultwiki disabled, it works perfectly. But as soon as VW enabled, it just perpetually churns out temp files.
 
A typo in array keys 'tag_id' vs $tag_id
Usually something like that just issues an E_NOTICE and doesn't bring whole servers down.

Might I suggest a change to the Sitemap builder where it will stop building the sitemap if it exceeds certain size limits? Even if a site really had 2B tags, I doubt they would want a 45G sitemap file. Sometimes the admin won't notice (to uncheck including e.g. tags in the sitemap) until it's too late.

I believe there are already protections like this in place for attachment storage (once the site has XGB of attachments, XenForo stops letting users upload them). EDIT: Actually I can't find such an option in XenForo at quick glance. I found it on my vBulletin test board though. If it doesn't exist, might I suggest this as well?
 
Last edited:
Top Bottom