Fixed [1.4] Sitemap invalid XML. Bug in the sitemap generator?

Stuart Wright

Well-known member
Google webmaster tools reports reports 3 parsing errors with my sitemap.
I picked one at random
www.avforums.com/sitemap.php?c=17
downloaded and unzipped the gzip and submitted the xml to http://validator.w3.org/
It reported unexpected end tags for /loc and /url on lines 34528 and 45768.

Line 34528 reads
Code:
ssories-unlocked-jailbroken.1010808/</loc><lastmod>2009-06-05T13:46:10+00:00</lastmod></url>
The previous and next lines are fine, so the start of line 34528 has been truncated.
The thread in question is here:
https://www.avforums.com/threads/ap...-all-accessories-unlocked-jailbroken.1010808/

Line 45768 reads
Code:
c>https://www.avforums.com/threads/unlocked-which-network.1024755/</loc><lastmod>2009-06-25T15:10:51+00:00</lastmod></url>
The previous and next lines are fine, so the start of line 45768 has been truncated.

I checked another of the bad files and it had more truncated lines of a similar nature.
I'm wondering whether maybe there is an odd character in the title field of the database. This thread will have been imported from vBulletin. Can tell from looking at it via phpmyadmin, though.
Is this a bug in the sitemap generator?
 
Last edited:
If anything, this is an issue in the compression. I'm not sure why this would randomly happen here. If you rebuild the sitemap manually, can you reproduce it?
 
Can you download the sitemap files directly from the server and see if they have the same problem? (internal_data/sitemaps -- based on the latest error, it'd be the sitemap-#########-23.xml.gz file)
 
Can you download the sitemap files directly from the server and see if they have the same problem? (internal_data/sitemaps -- based on the latest error, it'd be the sitemap-#########-23.xml.gz file)
#23 does not have errors, but #27 does and both versions are identical, with the same error in both methods of getting it. Which implies it's the building of the sitemap rather than whatever sitemap.php does to it.

[Edit - Chris D wrote an addon which includes our editorial content into the sitemap. I disabled this and ran the rebuild. While #27 is now ok, #13, which gave us errors before, still has errors.]
 
Last edited:
In library/Model/Sitemap.php, can you try replacing:
Code:
gzwrite($compressedFile, fread($readFile, 8192));
with:
Code:
gzwrite($compressedFile, fread($readFile, 524288));
It appears to be gzipping which isn't working correctly. I was trying to isolate any manipulation done by the web server, but it doesn't look like that's interfering. If the gzip itself is failing, that's far from ideal as then the functions simply can't be trusted. However, a larger block size may help.
 
In library/Model/Sitemap.php, can you try replacing:
Code:
gzwrite($compressedFile, fread($readFile, 8192));
with:
Code:
gzwrite($compressedFile, fread($readFile, 524288));
It appears to be gzipping which isn't working correctly. I was trying to isolate any manipulation done by the web server, but it doesn't look like that's interfering. If the gzip itself is failing, that's far from ideal as then the functions simply can't be trusted. However, a larger block size may help.
Did as you instructed and rebuilt the sitemap and there is still a similar error in #13 http://www.avforums.com/sitemap.php?c=13
 
That certainly implies that gzwrite is randomly failing. The fact that this is sometimes not working makes me think there's an underlying bug there (not thread safe?).

In the mean time, you can disable gzip compression of your sitemaps by replacing:
Code:
$canCompress = function_exists('gzopen');
with:
Code:
$canCompress = false;
in library/XenForo/Deferred/Sitemap.php.
 
That certainly implies that gzwrite is randomly failing. The fact that this is sometimes not working makes me think there's an underlying bug there (not thread safe?).

In the mean time, you can disable gzip compression of your sitemaps by replacing:
Code:
$canCompress = function_exists('gzopen');
with:
Code:
$canCompress = false;
in library/XenForo/Deferred/Sitemap.php.
Done and rebuilding the sitemap. Could this relate to the version of PHP, maybe? https://www.avforums.com/info.php
5.4.33 - I could ask Tim to upgrade us.
 
Would you mind trying this test script? (I can run it for you but I'd need access.) It doesn't depend on XF so you can put it where you like, however it does have to variables that will need to be modified at the top:

$source needs to point to one of the uncompressed sitemap XML files (internal_data/sitemap/...). You can copy the files out if you like, but the script doesn't actually write to them.

$dest is the file it writes to. This needs to be writable by the PHP user (you could probably write it to /tmp without issue).

Run the script (CLI or via a browser). After the first line, is it just dots or is there a "failed" comment? If you run it multiple times, is the md5 hash listed at the top the same?
 

Attachments

Run the script (CLI or via a browser). After the first line, is it just dots or is there a "failed" comment? If you run it multiple times, is the md5 hash listed at the top the same?
Thanks, Mike.
I ran it on 4 different files which Google Webmaster tools said it didn't like.
Here are the results:
#7
Expected md5: cf9ea79fa80b912764afa1c5e83ab78f (0.19372487068176)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

#13
Expected md5: b8227b1dc3863d9ed6b04877a0ced64b (0.12942790985107)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

#27
Expected md5: 21f400fd41d6cfa789cd384f5a9158da (0.26321911811829)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

#35
Expected md5: 1d41f401f50e6dc173cd914d9b962e11 (0.096315860748291)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The MD5 did not change when I re-ran it for the same file.
 
I really don't see why this is happening. There are really only two situations I see:
  • Race condition causing multiple sitemap builds to happen simultaneously. The deferred job system is designed to prevent this though. Rebuilding it manually should definitely not have this.
  • The server is lying about writing the file out. Should be unlikely, but I suppose it could be possible.
The strings that are appearing when the file is broken are in the middle of strings that we're writing so it shouldn't be a generation issue in the code.

To do any more debugging, I'd probably need to get access (ACP access to rebuild it and file access to test changes).
 
I have made changes relating to this (file locking and flushing) and it seems to be work.

I'm not sure why this was happening to Stuart, as the specs say that writing to a file in append mode is an atomic operation (POSIX requirement, from what I understand), so writing half a line is confusing. More correctly, it looks like the write goes through but then the next open/read doesn't actually see that write. This even happened when I was debugging via error_log(). The lock/flush on the sitemap file does seem to resolve this at least.
 
Top Bottom