1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Fixed [1.4] Sitemap invalid XML. Bug in the sitemap generator?

Discussion in 'Resolved Bug Reports' started by Stuart Wright, Oct 3, 2014.

  1. Stuart Wright

    Stuart Wright Well-Known Member

    Google webmaster tools reports reports 3 parsing errors with my sitemap.
    I picked one at random
    www.avforums.com/sitemap.php?c=17
    downloaded and unzipped the gzip and submitted the xml to http://validator.w3.org/
    It reported unexpected end tags for /loc and /url on lines 34528 and 45768.

    Line 34528 reads
    Code:
    ssories-unlocked-jailbroken.1010808/</loc><lastmod>2009-06-05T13:46:10+00:00</lastmod></url>
    The previous and next lines are fine, so the start of line 34528 has been truncated.
    The thread in question is here:
    https://www.avforums.com/threads/ap...-all-accessories-unlocked-jailbroken.1010808/

    Line 45768 reads
    Code:
    c>https://www.avforums.com/threads/unlocked-which-network.1024755/</loc><lastmod>2009-06-25T15:10:51+00:00</lastmod></url>
    The previous and next lines are fine, so the start of line 45768 has been truncated.

    I checked another of the bad files and it had more truncated lines of a similar nature.
    I'm wondering whether maybe there is an odd character in the title field of the database. This thread will have been imported from vBulletin. Can tell from looking at it via phpmyadmin, though.
    Is this a bug in the sitemap generator?
     
    Last edited: Oct 3, 2014
  2. Mike

    Mike XenForo Developer Staff Member

    If anything, this is an issue in the compression. I'm not sure why this would randomly happen here. If you rebuild the sitemap manually, can you reproduce it?
     
  3. Stuart Wright

    Stuart Wright Well-Known Member

    Just rebuilt the sitemap and have a similar error in same file giving an error last time.
    www.avforums.com/sitemap.php?c=23
     
  4. Stuart Wright

    Stuart Wright Well-Known Member

    @Mike should I raise a ticket for this? I feel it's somewhat urgent that we have a working sitemap.
     
  5. Mike

    Mike XenForo Developer Staff Member

    Can you download the sitemap files directly from the server and see if they have the same problem? (internal_data/sitemaps -- based on the latest error, it'd be the sitemap-#########-23.xml.gz file)
     
  6. Stuart Wright

    Stuart Wright Well-Known Member

    #23 does not have errors, but #27 does and both versions are identical, with the same error in both methods of getting it. Which implies it's the building of the sitemap rather than whatever sitemap.php does to it.

    [Edit - Chris D wrote an addon which includes our editorial content into the sitemap. I disabled this and ran the rebuild. While #27 is now ok, #13, which gave us errors before, still has errors.]
     
    Last edited: Oct 8, 2014
  7. Mike

    Mike XenForo Developer Staff Member

    In library/Model/Sitemap.php, can you try replacing:
    Code:
    gzwrite($compressedFile, fread($readFile, 8192));
    with:
    Code:
    gzwrite($compressedFile, fread($readFile, 524288));
    It appears to be gzipping which isn't working correctly. I was trying to isolate any manipulation done by the web server, but it doesn't look like that's interfering. If the gzip itself is failing, that's far from ideal as then the functions simply can't be trusted. However, a larger block size may help.
     
  8. Stuart Wright

    Stuart Wright Well-Known Member

    Did as you instructed and rebuilt the sitemap and there is still a similar error in #13 http://www.avforums.com/sitemap.php?c=13
     
  9. Mike

    Mike XenForo Developer Staff Member

    That certainly implies that gzwrite is randomly failing. The fact that this is sometimes not working makes me think there's an underlying bug there (not thread safe?).

    In the mean time, you can disable gzip compression of your sitemaps by replacing:
    Code:
    $canCompress = function_exists('gzopen');
    with:
    Code:
    $canCompress = false;
    in library/XenForo/Deferred/Sitemap.php.
     
  10. Stuart Wright

    Stuart Wright Well-Known Member

    Done and rebuilding the sitemap. Could this relate to the version of PHP, maybe? https://www.avforums.com/info.php
    5.4.33 - I could ask Tim to upgrade us.
     
  11. Karelke

    Karelke Active Member

    It's worth a shot IMO.
     
  12. Stuart Wright

    Stuart Wright Well-Known Member

  13. Mike

    Mike XenForo Developer Staff Member

    Would you mind trying this test script? (I can run it for you but I'd need access.) It doesn't depend on XF so you can put it where you like, however it does have to variables that will need to be modified at the top:

    $source needs to point to one of the uncompressed sitemap XML files (internal_data/sitemap/...). You can copy the files out if you like, but the script doesn't actually write to them.

    $dest is the file it writes to. This needs to be writable by the PHP user (you could probably write it to /tmp without issue).

    Run the script (CLI or via a browser). After the first line, is it just dots or is there a "failed" comment? If you run it multiple times, is the md5 hash listed at the top the same?
     

    Attached Files:

  14. Stuart Wright

    Stuart Wright Well-Known Member

    Thanks, Mike.
    I ran it on 4 different files which Google Webmaster tools said it didn't like.
    Here are the results:
    The MD5 did not change when I re-ran it for the same file.
     
  15. Mike

    Mike XenForo Developer Staff Member

    I'm assuming that since you rebuilt it without compression there aren't issues?
     
  16. Stuart Wright

    Stuart Wright Well-Known Member

    So the output of sitemap.php is now xml.
    I just resubmitted to Google and it reported errors in some of the files.
    Although if I try to open the files, I got errors in all that I tried:
    E.g. https://www.avforums.com/sitemap.php?c=1
    So maybe this is not a gzip thing?
     
  17. Mike

    Mike XenForo Developer Staff Member

    I really don't see why this is happening. There are really only two situations I see:
    • Race condition causing multiple sitemap builds to happen simultaneously. The deferred job system is designed to prevent this though. Rebuilding it manually should definitely not have this.
    • The server is lying about writing the file out. Should be unlikely, but I suppose it could be possible.
    The strings that are appearing when the file is broken are in the middle of strings that we're writing so it shouldn't be a generation issue in the code.

    To do any more debugging, I'd probably need to get access (ACP access to rebuild it and file access to test changes).
     
  18. Stuart Wright

    Stuart Wright Well-Known Member

    PM sent.
     
  19. Mike

    Mike XenForo Developer Staff Member

    I have made changes relating to this (file locking and flushing) and it seems to be work.

    I'm not sure why this was happening to Stuart, as the specs say that writing to a file in append mode is an atomic operation (POSIX requirement, from what I understand), so writing half a line is confusing. More correctly, it looks like the write goes through but then the next open/read doesn't actually see that write. This even happened when I was debugging via error_log(). The lock/flush on the sitemap file does seem to resolve this at least.
     
    imthebest, thedude, Liam W and 2 others like this.

Share This Page