Fixed Spiders going crazy

nrep

Well-known member
I've got quite a few XF installations and each of them has the same problem. If you've got a Google Webmaster Tools account set up then you'll probably see the same thing too. Search engine spiders are somehow running increasing numbers of searches on the site, far out numbering actual pages.

I suspect this is because the "/find-new/threads" link isn't a "nofollow" and it generates a unique URL each time, so spiders will re-crawl and keep indexing new pages. Here's a screenshot from WMT showing an example of the scale (all of my XF installations are small and show similar things):

crawling.webp

Here's a snippet from my log to show why I think it's the new content search pages causing the problem (note how frequent the crawls are):

Code:
2013-03-12 04:56:55 123.123.123.123 GET /find-new/81052/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 260 140
2013-03-12 04:56:55 123.123.123.123 GET /find-new/345592/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13034 261 234
2013-03-12 04:57:01 123.123.123.123 GET /find-new/148106/threads page=4 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 109
2013-03-12 04:57:02 123.123.123.123 GET /find-new/345593/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13051 261 218
2013-03-12 04:57:14 123.123.123.123 GET /find-new/148107/threads page=2 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 109
2013-03-12 04:57:16 123.123.123.123 GET /find-new/345594/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13042 261 218
2013-03-12 04:57:24 123.123.123.123 GET /find-new/65848/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 260 109
2013-03-12 04:57:24 123.123.123.123 GET /find-new/345595/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13019 261 218
2013-03-12 04:57:29 123.123.123.123 GET /find-new/148106/threads page=6 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 93
2013-03-12 04:57:29 123.123.123.123 GET /find-new/345596/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13039 261 234
2013-03-12 04:57:43 123.123.123.123 GET /find-new/148106/threads page=8 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 303 0 0 615 268 124
2013-03-12 04:57:44 123.123.123.123 GET /find-new/345597/threads - 80 - 66.249.73.205 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - - www.gardening-forums.com 200 0 0 13060 261 234
2013-03-12 04:57:52 123.123.123.123 GET /find-new/40135/threads page=4 80 - 66.249.73.205

I think it would make sense to "nofollow" the search URLs to prevent this, not least because it adds extra load and wasted bandwidth on repeated search crawls.

Although a robots.txt would work, it's good practice to "nofollow" search pages like this that shouldn't be crawled (just as the search results page is "noindex"). The style selector is "nofollow", as are other such elements, so I think it would be sensible to also make this tiny change to nofollow the "what's new" (and perhaps any other similar) links like this.
 
Top Bottom