1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Fixed Spiders going crazy

Discussion in 'Resolved Bug Reports' started by nrep, Mar 13, 2013.

  1. nrep

    nrep Well-Known Member

    I've got quite a few XF installations and each of them has the same problem. If you've got a Google Webmaster Tools account set up then you'll probably see the same thing too. Search engine spiders are somehow running increasing numbers of searches on the site, far out numbering actual pages.

    I suspect this is because the "/find-new/threads" link isn't a "nofollow" and it generates a unique URL each time, so spiders will re-crawl and keep indexing new pages. Here's a screenshot from WMT showing an example of the scale (all of my XF installations are small and show similar things):

    crawling.png

    Here's a snippet from my log to show why I think it's the new content search pages causing the problem (note how frequent the crawls are):

    Code:
    2013-03-12 04:56:55 123.123.123.123 GET /find-new/81052/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 260 140
    2013-03-12 04:56:55 123.123.123.123 GET /find-new/345592/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13034 261 234
    2013-03-12 04:57:01 123.123.123.123 GET /find-new/148106/threads page=4 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 109
    2013-03-12 04:57:02 123.123.123.123 GET /find-new/345593/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13051 261 218
    2013-03-12 04:57:14 123.123.123.123 GET /find-new/148107/threads page=2 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 109
    2013-03-12 04:57:16 123.123.123.123 GET /find-new/345594/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13042 261 218
    2013-03-12 04:57:24 123.123.123.123 GET /find-new/65848/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 260 109
    2013-03-12 04:57:24 123.123.123.123 GET /find-new/345595/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13019 261 218
    2013-03-12 04:57:29 123.123.123.123 GET /find-new/148106/threads page=6 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 93
    2013-03-12 04:57:29 123.123.123.123 GET /find-new/345596/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13039 261 234
    2013-03-12 04:57:43 123.123.123.123 GET /find-new/148106/threads page=8 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 124
    2013-03-12 04:57:44 123.123.123.123 GET /find-new/345597/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13060 261 234
    2013-03-12 04:57:52 123.123.123.123 GET /find-new/40135/threads page=4 80 - 66.249.73.205 
    I think it would make sense to "nofollow" the search URLs to prevent this, not least because it adds extra load and wasted bandwidth on repeated search crawls.

    Although a robots.txt would work, it's good practice to "nofollow" search pages like this that shouldn't be crawled (just as the search results page is "noindex"). The style selector is "nofollow", as are other such elements, so I think it would be sensible to also make this tiny change to nofollow the "what's new" (and perhaps any other similar) links like this.
     
  2. Mike

    Mike XenForo Developer Staff Member

    I've nofollowed those links (there are a few), so hopefully that will sort it.
     
    Eagle, Slavik and nrep like this.
  3. nrep

    nrep Well-Known Member

    Thanks, will keep an eye on it after upgrading to 1.1.4 and post back if there are any more :).
     

Share This Page