Anyone else getting 404 errors in their apache access log when googlebot tries to read robots.txt?

Dean · Aug 14, 2011

I've been trying to figure this out for a while. Every day I get a few 404 errors in the apache access_log when googlebot is trying to access robots.txt. This happens to a far lesser extent for other bots such as Baidu, and no problems with bingbot, yahoo, or several others.

Is this occurring for anyone else?

It appears googlebot successfully downloads robots.txt a few times a day based on the cPanel raw access logs that I've been going through, but I cannot understand where the 404 errors are coming from in the apache access log...

I've no idea where the apache access log is traditionally kept, but on our account it is at /usr/local/apache/logs/access_log

That file is a bit large...
.

Floris · Aug 14, 2011

XenForo doesn't come with robots.txt, so sites just powered by xenforo or sites without robots.txt will run into these - why they don't show in error_log, is not sure to me.

Shamil · Aug 15, 2011

I am terribly sorry to have to say, you might want to have a look in the error_log file. Without the full log/request it's going to be a bit difficult to say.

Have you signed the website up to Google Webmaster, then going through its processes to monitor the website?

Dean · Aug 15, 2011

I figured it out

66.249.67.119 - - [13/Aug/2011:11:41:10 -0700] "GET /robots.txt HTTP/1.1" 404 2005 <- access_log
66.249.67.119 - - [13/Aug/2011:23:43:37 -0700] "GET /robots.txt HTTP/1.1" 404 2005 <- access_log
( don't have the cPanel raw access logs for the above 2 attempts)

This is what has been happening:
66.249.67.119 - - [14/Aug/2011:11:55:31 -0700] "GET /robots.txt HTTP/1.1" 404 2003 <- access_log
[Sun Aug 14 11:55:31 2011] [error] [client 66.249.67.119] File does not exist: /usr/local/apache/htdocs/robots.txt <- error_log - and of course that never reaches the cPanel access raw logs cause... it was an error
66.249.67.34 - - [14/Aug/2011:14:06:35 -0700] "GET /robots.txt HTTP/1.1" 301 202 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.67.34 - - [14/Aug/2011:14:06:35 -0700] "GET /robots.txt HTTP/1.1" 200 187 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The last 2 lines were from the cpanel raw access logs.

I just realized what is happening. Google (and a few other bots) are accessing the site using this format:
IPaddress/~mysite - that is how 66.249.67.119 is accessing it. It *always* trys to use the IP/~ format.

Whereas 66.249.67.34 is using the www.mysite.com method, every time.

I did not notice any of this until the Enable mod_userdir Protection got checked by the hosting provider (I asked them to do something, and that was a side effect). Which of course as you probably know, will not allow accessing the site using the IP/~mysite format.

Shamil · Aug 15, 2011

uh oh - that shouldn't really happen

Dean · Aug 15, 2011

Personally, on my hosting account, the access_log items happen first and that is of more interest to me. Whereas the error_log shows the results of the what happened in the access_log. O course the items that are successful are in the cPanel raw access logs.

It could easily be different on other hosting accounts, not sure.

Dean · Aug 15, 2011

Shamil said:
uh oh - that shouldn't really happen

rut roh?

Why?

Shamil · Aug 15, 2011

Dean said:
Personally, on my hosting account, the access_log items happen first and that is of more interest to me. Whereas the error_log shows the results of the what happened in the access_log. O course the items that are successful are in the cPanel raw access logs.

It could easily be different on other hosting accounts, not sure.

Access_log shows exactly what it describes... any access requests. Error_log is an error_log is something goes wrong. There is a spider crawling up my neck... I can feel it.

Shamil · Aug 15, 2011

Dean said:
rut roh?

Why?

Usually, you'd want to avoid it, but I wonder where it learned the IP link.

Floris · Aug 15, 2011

It'a always best for systems internally to go to longip for something, if there's no reverse resolve set or it times out because their own dns is overwhelmed, it might just bounce back to the ip, not the actual host for the ip. Or just poor coding, etc.

ipv4 host to ip, ip to host.

You can check with dnsstuff.com (some feature are free, some not) to analyze the dns / name server / host to ip / traceroute / reverse resolve / etc.

Dean · Aug 15, 2011

Shamil said:
Usually, you'd want to avoid it, but I wonder where it learned the IP link.

Don't know. That is the baffling part. It seems, somehow, when we were using vb that was how everything was being accessed for some reason.

My access_log goes back to december. For the 5 weeks we were using vb, there are 150k entries using the IP/~ format.

I've only just started looking at these types of things. And.. I do not have any cPanel raw access files from when we were using vb.

So I am * definitely* perplexed. If anyone has any thoughts, please share.

Dean · Aug 15, 2011

Could this be a vb thing using IP/~
?

I've really been curious. Obviously I could have had things set up incorrectly..

Shamil · Aug 15, 2011

Dean said:
Could this be a vb thing using IP/~
?

I've really been curious. Obviously I could have had things set up incorrectly..

I doubt it's vB, could be an incorrect setup, but I am yet to find a system where this has been done unintentionally.

As Floris said, it could be internally, but doesn't really explain why Google's trying to hit it. Does a google search of the IP yield anything?

Dean · Aug 15, 2011

Shamil said:
Does a google search of the IP yield anything?

I found 3 or 4 links using IP/~ on a few forums.

We had an issue a while back (December 2009) and told everyone to use the IP/~ format for about 18 hours, then I deleted all references to it.

That still does not explain the *huge* number of accesses during the 5 weeks we were using vb..

Shamil · Aug 15, 2011

Dean said:
I found 3 or 4 links using IP/~ on a few forums.

We had an issue a while back (December 2009) and told everyone to use the IP/~ format for about 18 hours, then I deleted all references to it.

That still does not explain the *huge* number of accesses during the 5 weeks we were using vb..

I'd request Google to remove the links as a first step:

If you own the site, you can verify your ownership in Webmaster Tools and use the verified URL removal tool to remove an entire directory from Google's search results.
Note: To ensure your directory or site is permanently removed, you should use robots.txt to block crawler access to the directory (or, if you’re removing a site, to your whole site). We recommend doing this before or soon after requesting removal of the directory. Otherwise, your content may later reappear in search results. (For more information about blocking search engines from confidential information, see Blocking Google.) Returning a 404 HTTP status code isn't enough, because it's possible for a directory to return a 404 status code, but still serve out files underneath it. Using robots.txt to block a directory ensures that all of its children are disallowed as well.
Once you have completed one of the steps above, you can request removal of the directory and all of its contents from search results using the URL Removal Tool in Webmaster Tools.

On the Webmaster Tools home page, click the site you want.
On the Dashboard, click Site configuration in the left-hand navigation.
Click Crawler access, and then click Remove URL.
Click New removal request.
Type the URL of the directory you want removed from search results and then click Continue. How to find the right URL. Note that the URL is case-sensitive—you will need to submit the URL using exactly the same characters and the same capitalization that the site uses. If you want to remove the whole site, you can leave this blank.
Click Remove directory
Select the checkbox to confirm that you have completed the requirements listed in this article, and then click Submit Request.

Be careful when requesting removal of a site. The only reason you should request a site removal is when you want all the contents of a site permanently removed from Google’s index.
Removing https://www.example.com will also remove http://www.example.com, as well as http://example.com and https://example.com.
If you’re worried that your site may have a penalty, or you want to start from scratch after purchasing a domain from somebody else, we recommend filing a reconsideration request letting us know what you're worried about and what has changed. If your site has been hacked, checkthis article for recommendations.

It sounds like this: Google shouldn't be harassing the site, it should gently caress the site for robots.txt.

Dean · Aug 15, 2011

Shamil said:
I'd request Google to remove the links as a first step:

You mean the 3-4 links I found on other forums? If so, I could probably just call the people that own the sites, we are a close knit group.

Would this work?
Disallow: xxx.xx.xx.xxx/~

Dean · Aug 15, 2011

And yes in google webmasters tools I confirmed ownership of www.mysite.com, but not http://mysite.com - because we have that re-directed to www

How do I confirm I own the IP?

IP/~mysite

?

Shamil · Aug 15, 2011

Dean said:
You mean the 3-4 links I found on other forums? If so, I could probably just call the people that own the sites, we are a close knit group.

Would this work?
Disallow: xxx.xx.xx.xxx/~

Removing the IP from the posts should be fine, but it might take a while for Googlebot to slow down. What I'd do, assuming you're using cPanel is to place a robots.txt in /var/www with content:

Disallow *

That'll stop the errors from being produced in the meantime.

Disallow: xxx.xx.xx.xxx/~ is not a valid command read by bots.

Shamil · Aug 15, 2011

Dean said:
And yes in google webmasters tools I confirmed ownership of www.mysite.com, but not http://mysite.com - because we have that re-directed to www

How do I confirm I own the IP?

IP/~mysite

?

IP/ should be fine. Stick a robots.txt in /var/www as above, and see if you can see if you can access robots.txt using the URL Googlebot tries.

Dean · Aug 15, 2011

Shamil said:
Removing the IP from the posts should be fine, but it might take a while for Googlebot to slow down. What I'd do, assuming you're using cPanel is to place a robots.txt in /var/www with content:

Disallow *

That'll stop the errors from being produced in the meantime.

Yes, that will certainly stop all well behaved bots from searching my site, completely - which is not really what I want.

95% of the crawling is fine, just a few bots are trying to access the IP/~ method. I am not sure they are hurting anything actually.

Anyone else getting 404 errors in their apache access log when googlebot tries to read robots.txt?

in memoriam

Floris

Guest

Well-known member

in memoriam

Well-known member

in memoriam

in memoriam

Well-known member

Well-known member

Floris

Guest

in memoriam

in memoriam

Well-known member

in memoriam

Well-known member

in memoriam

in memoriam

Well-known member

Well-known member

in memoriam

We value your privacy