Anyone else getting 404 errors in their apache access log when googlebot tries to read robots.txt?

Shamil · Aug 15, 2011

Dean said:
Yes, that will certainly stop all well behaved bots from searching my site, completely - which is not really what I want.

95% of the crawling is fine, just a few bots are trying to access the IP/~ method. I am not sure they are hurting anything actually.

This was supposed to be in /var/www, which also holds the default cPanel pages. It should not affect your website at domain.com. You can specify the bot

I'd usually advocate: if it ain't broke, don't fix it, one I've known to bite me later.

I assume you just don't want your log to grow?

Dean · Aug 15, 2011

Is it possible that because we are on a vps that the location may be different? Every time IP/~mysite/robots.txt is accessed this shows up in the apache error_log:
File does not exist: /usr/local/apache/htdocs/robots.txt

Actually the only indication that the IP/~mysite format is being used, is because every time that particular googlebot accesses anything else, it does so via that method. Nothing that googlebot does is successful and being logged to the cPanel access files..

Dean · Aug 15, 2011

Dean said:
Is it possible that because we are on a vps that the location may be different? Every time IP/~mysite/robots.txt is accessed this shows up in the apache error_log:
File does not exist: /usr/local/apache/htdocs/robots.txt

I'm fairly sure that would be the proper location, the directory structure on my hosting account is strange.. and I put the robots.txt file that with the Disallow *

Lets see what side effects this has...

Shamil · Aug 15, 2011

Dean said:
I'm fairly sure that would be the proper location, the directory structure on my hosting account is strange.. and I put the robots.txt file that with the Disallow *

Lets see what side effects this has...

I assume you meant that the robots.txt file is in /usr/local/apache/htdocs sometimes I hate control panels.

But yeah, it Should not affect your site, as long as when you go to domain.com/robots.txt you don't see the disallow *. domain being the website in question.

Typing on iPad wi it trying to correct me is more difficult that I thought

Floris · Aug 15, 2011

As long as your root of your site has robots.txt, it won't 404 for anybody loading it.
As a guest to your site you can't load robots.txt - something's up.

If a control panel require some special treatment to get a robots.txt file to work, perhaps it's worth considering moving to a host where that isn't something we can consider nonsense.

Dean · Aug 15, 2011

Floris said:
As long as your root of your site has robots.txt, it won't 404 for anybody loading it.
As a guest to your site you can't load robots.txt - something's up.

If a control panel require some special treatment to get a robots.txt file to work, perhaps it's worth considering moving to a host where that isn't something we can consider nonsense.

The robots.txt works fine when people (and bots) go to www.mysite.com
Anyone trying to access my site via xxx.xx.xx.xxx/~mysite/robots.txt gets a 404 error - actually accessing anything via the IPaddress/~mysite will get a 404 error.
All 404 errors that happen in the apache access_log are re-directed to files located in /usr/local/apache/htdocs/ - though the directory structure is a bit strange because it is a vps
I have now put up a robots.txt file in that directory
My best guess is - that 1/2 the time google webmaster tools will consider one of the robots.txt, and the other 1/2 of the time the other robots.txt - I need to wait to confirm.
There is 1 errant googlebot that keeps accessing my site via IP/~mysite. Last time I checked, it was visiting 7 times/hour, while 2 other google bots are visiting www.mysite.com correctly about 350 times/hour
In addition to that 1 errant googlebot, there are a few more search bots from msn, and thousands of visits from people trying to scrap content/script kiddies accessing via IPaddress/~mysite - all that visit via IP/~mysite are getting 404 errors.

It has nothing at all to do with any cPanel, it is above the cPanel access for all cPanel accounts trying to access via IPaddress/~____ those are all disabled.
I like my host a lot.

The only remaining question - is *why* while we were using vb, did we have 150,000 accesses via IP/~mysite every month. Many were legitimate members and good members of my forum. They cannot all be attributed to a few links that may have been posted on another forum.

Floris · Aug 15, 2011

I better understand it now, thanks for explaining!

Dean · Aug 17, 2011

Dean said:
I'm fairly sure that would be the proper location, the directory structure on my hosting account is strange.. and I put the robots.txt file that with the Disallow *

Lets see what side effects this has...

And apparently it did have an effect.. The number of googlebots visits has increased 10x, in fact it has gone from 1 errant rouge googlebot to 3 of them - since I added the robots.txt file where the apache 404 errors are re-directed to.

My wild guess would be that since it was getting something using the IPaddress/~mysite format, it is crawling even more. The good news is that so far, google webmasters tools does not show 'Disallow *'... but I am not going to take the chance of leaving that up. I will remove that file from the apache error directory (but still leave the proper robots.tx file in www.mysite.com)

Shamil · Aug 17, 2011

Dean said:
And apparently it did have an effect.. The number of googlebots visits has increased 10x, in fact it has gone from 1 errant rouge googlebot to 3 of them - since I added the robots.txt file where the apache 404 errors are re-directed to.

My wild guess would be that since it was getting something using the IPaddress/~mysite format, it is crawling even more. The good news is that so far, google webmasters tools does not show 'Disallow *'... but I am not going to take the chance of leaving that up. I will remove that file from the apache error directory (but still leave the proper robots.tx file in www.mysite.com)

Have you registered the website, and IP in Google Webmaster?

Dean · Aug 17, 2011

Shamil said:
Have you registered the website, and IP in Google Webmaster?

I have registered the www.website.com with google, the http://website.com is re-directed to www so that cannot be registered unless I disable the re-direct for a short time.

I have not registered the IP. Based on what I am seeing, that would be counter productive i.e. googlebots tend to take any information and crawl it. Had the number of googlebot visits gone done, or stayed the same, I might do that.. but there is a 10x increase - which at the moment I attribute to them finding the robots.txt via IP.

They seem as bad as small children. "Don't do that" usually results in them "doing that"..

Make sense?

Shamil · Aug 17, 2011

Dean said:
I have registered the www.website.com with google, the http://website.com is re-directed to www so that cannot be registered unless I disable the re-direct for a short time.

I have not registered the IP. Based on what I am seeing, that would be counter productive i.e. googlebots tend to take any information and crawl it. Had the number of googlebot visits gone done, or stayed the same, I might do that.. but there is a 10x increase - which at the moment I attribute to them finding the robots.txt via IP.

They seem as bad as small children. "Don't do that" usually results in them "doing that"..

Make sense?

Ah ok - Yep makes sense ... still trawling through logs, and making notes.

Dean · Aug 22, 2011

Terrific.

The number of site links has gone from 6 to 2. And one of those is the 'Cpanel' error page - because the errant googlebot is getting 404 errors.

Question is, which is better?

blocking that errant googlebot
submitting a site map and see if that helps
something else?

EDIT: I just checked the apache access_log and there are now 8 different googlebots accessing via the IPaddress/~mysite method... so blocking that many googlebots would not be good (I believe).
To summarize:

I changed my hosting account to no longer allow 301 re-directs when it is accessed via the IPaddress/~mysite method
That created a *huge* number of 404 errors - for some reason there were many bots & scrapping programs accessing via that method
Up until that point there was only 1 googlebot IP, the rest were accessing via www.site.com
404 errors were going into a specific directory on the server, and I uploaded a robots.txt file with "Disallow *" in it - trying to dissuade the google bot.
After it had successfully downloaded that robots.txt file with Disallow * in it - that increased the number of googlebots from 1 to now 8 (the real robots.txt file was still at my www.___.com/robots.txt
Those 8 google bots now think my apache 404 error page is worthy of a site link, and demoted the other site links which really were pertinent.

For those who may not know, if I type 'xenforo' in google, the xenforo site links show up:

Anyone else getting 404 errors in their apache access log when googlebot tries to read robots.txt?

Shamil

Well-known member

Dean

in memoriam

Dean

in memoriam

Shamil

Well-known member

Floris

Guest

Dean

in memoriam

Floris

Guest

Dean

in memoriam

Shamil

Well-known member

Dean

in memoriam

Shamil

Well-known member

Dean

in memoriam

We value your privacy