1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Anyone else getting 404 errors in their apache access log when googlebot tries to read robots.txt?

Discussion in 'Server Configuration and Hosting' started by Dean, Aug 14, 2011.

  1. Dean

    Dean Well-Known Member

    I've been trying to figure this out for a while. Every day I get a few 404 errors in the apache access_log when googlebot is trying to access robots.txt. This happens to a far lesser extent for other bots such as Baidu, and no problems with bingbot, yahoo, or several others.

    Is this occurring for anyone else?

    It appears googlebot successfully downloads robots.txt a few times a day based on the cPanel raw access logs that I've been going through, but I cannot understand where the 404 errors are coming from in the apache access log...

    I've no idea where the apache access log is traditionally kept, but on our account it is at /usr/local/apache/logs/access_log

    That file is a bit large...
  2. Floris

    Floris Guest

    XenForo doesn't come with robots.txt, so sites just powered by xenforo or sites without robots.txt will run into these - why they don't show in error_log, is not sure to me.
  3. Shamil

    Shamil Well-Known Member

    I am terribly sorry to have to say, you might want to have a look in the error_log file. Without the full log/request it's going to be a bit difficult to say.

    Have you signed the website up to Google Webmaster, then going through its processes to monitor the website?
  4. Dean

    Dean Well-Known Member

    I figured it out :) - - [13/Aug/2011:11:41:10 -0700] "GET /robots.txt HTTP/1.1" 404 2005 <- access_log - - [13/Aug/2011:23:43:37 -0700] "GET /robots.txt HTTP/1.1" 404 2005 <- access_log
    ( don't have the cPanel raw access logs for the above 2 attempts)

    This is what has been happening: - - [14/Aug/2011:11:55:31 -0700] "GET /robots.txt HTTP/1.1" 404 2003 <- access_log
    [Sun Aug 14 11:55:31 2011] [error] [client] File does not exist: /usr/local/apache/htdocs/robots.txt <- error_log - and of course that never reaches the cPanel access raw logs cause... it was an error - - [14/Aug/2011:14:06:35 -0700] "GET /robots.txt HTTP/1.1" 301 202 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" - - [14/Aug/2011:14:06:35 -0700] "GET /robots.txt HTTP/1.1" 200 187 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

    The last 2 lines were from the cpanel raw access logs.

    I just realized what is happening. Google (and a few other bots) are accessing the site using this format:
    IPaddress/~mysite - that is how is accessing it. It *always* trys to use the IP/~ format.

    Whereas is using the www.mysite.com method, every time.

    I did not notice any of this until the Enable mod_userdir Protection got checked by the hosting provider (I asked them to do something, and that was a side effect). Which of course as you probably know, will not allow accessing the site using the IP/~mysite format.
  5. Shamil

    Shamil Well-Known Member

    uh oh - that shouldn't really happen :eek:
  6. Dean

    Dean Well-Known Member

    Personally, on my hosting account, the access_log items happen first and that is of more interest to me. Whereas the error_log shows the results of the what happened in the access_log. O course the items that are successful are in the cPanel raw access logs.

    It could easily be different on other hosting accounts, not sure.
  7. Dean

    Dean Well-Known Member

    rut roh?

  8. Shamil

    Shamil Well-Known Member

    Access_log shows exactly what it describes... any access requests. Error_log is an error_log is something goes wrong. There is a spider crawling up my neck... I can feel it.
  9. Shamil

    Shamil Well-Known Member

    Usually, you'd want to avoid it, but I wonder where it learned the IP link.
  10. Floris

    Floris Guest

    It'a always best for systems internally to go to longip for something, if there's no reverse resolve set or it times out because their own dns is overwhelmed, it might just bounce back to the ip, not the actual host for the ip. Or just poor coding, etc.

    ipv4 host to ip, ip to host.

    You can check with dnsstuff.com (some feature are free, some not) to analyze the dns / name server / host to ip / traceroute / reverse resolve / etc.
  11. Dean

    Dean Well-Known Member

    Don't know. That is the baffling part. It seems, somehow, when we were using vb that was how everything was being accessed for some reason.

    My access_log goes back to december. For the 5 weeks we were using vb, there are 150k entries using the IP/~ format.

    I've only just started looking at these types of things. And.. I do not have any cPanel raw access files from when we were using vb.

    So I am * definitely* perplexed. If anyone has any thoughts, please share.
  12. Dean

    Dean Well-Known Member

    Could this be a vb thing using IP/~

    I've really been curious. Obviously I could have had things set up incorrectly..
  13. Shamil

    Shamil Well-Known Member

    I doubt it's vB, could be an incorrect setup, but I am yet to find a system where this has been done unintentionally.

    As Floris said, it could be internally, but doesn't really explain why Google's trying to hit it. Does a google search of the IP yield anything?
  14. Dean

    Dean Well-Known Member

    I found 3 or 4 links using IP/~ on a few forums.

    We had an issue a while back (December 2009) and told everyone to use the IP/~ format for about 18 hours, then I deleted all references to it.

    That still does not explain the *huge* number of accesses during the 5 weeks we were using vb..
  15. Shamil

    Shamil Well-Known Member

    I'd request Google to remove the links as a first step:

    If you own the site, you can verify your ownership in Webmaster Tools and use the verified URL removal tool to remove an entire directory from Google's search results.
    Note: To ensure your directory or site is permanently removed, you should use robots.txt to block crawler access to the directory (or, if you’re removing a site, to your whole site). We recommend doing this before or soon after requesting removal of the directory. Otherwise, your content may later reappear in search results. (For more information about blocking search engines from confidential information, see Blocking Google.) Returning a 404 HTTP status code isn't enough, because it's possible for a directory to return a 404 status code, but still serve out files underneath it. Using robots.txt to block a directory ensures that all of its children are disallowed as well.
    Once you have completed one of the steps above, you can request removal of the directory and all of its contents from search results using the URL Removal Tool in Webmaster Tools.
    1. On the Webmaster Tools home page, click the site you want.
    2. On the Dashboard, click Site configuration in the left-hand navigation.
    3. Click Crawler access, and then click Remove URL.
    4. Click New removal request.
    5. Type the URL of the directory you want removed from search results and then click Continue. How to find the right URL. Note that the URL is case-sensitive—you will need to submit the URL using exactly the same characters and the same capitalization that the site uses. If you want to remove the whole site, you can leave this blank.
    6. Click Remove directory
    7. Select the checkbox to confirm that you have completed the requirements listed in this article, and then click Submit Request.
    Be careful when requesting removal of a site. The only reason you should request a site removal is when you want all the contents of a site permanently removed from Google’s index.
    Removing https://www.example.com will also remove http://www.example.com, as well as http://example.com and https://example.com.
    If you’re worried that your site may have a penalty, or you want to start from scratch after purchasing a domain from somebody else, we recommend filing a reconsideration request letting us know what you're worried about and what has changed. If your site has been hacked, checkthis article for recommendations.

    It sounds like this: Google shouldn't be harassing the site, it should gently caress the site for robots.txt.
  16. Dean

    Dean Well-Known Member

    You mean the 3-4 links I found on other forums? If so, I could probably just call the people that own the sites, we are a close knit group. :)

    Would this work?
    Disallow: xxx.xx.xx.xxx/~
  17. Dean

    Dean Well-Known Member

    And yes in google webmasters tools I confirmed ownership of www.mysite.com, but not http://mysite.com - because we have that re-directed to www

    How do I confirm I own the IP?


  18. Shamil

    Shamil Well-Known Member

    Removing the IP from the posts should be fine, but it might take a while for Googlebot to slow down. What I'd do, assuming you're using cPanel is to place a robots.txt in /var/www with content:

    Disallow *

    That'll stop the errors from being produced in the meantime.

    Disallow: xxx.xx.xx.xxx/~ is not a valid command read by bots.
  19. Shamil

    Shamil Well-Known Member

    IP/ should be fine. Stick a robots.txt in /var/www as above, and see if you can see if you can access robots.txt using the URL Googlebot tries.
  20. Dean

    Dean Well-Known Member

    Yes, that will certainly stop all well behaved bots from searching my site, completely - which is not really what I want.

    95% of the crawling is fine, just a few bots are trying to access the IP/~ method. I am not sure they are hurting anything actually.

Share This Page