• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

XF 1.5 Googlebot eating bandwidth crawling /index.php?find-new/

#1
Guys,

I had my website account suspended due to excessive resource usage yesterday. I've just been in and checked my cPanel account today and CPU usage is constantly over 80% and bandwidth consumed is 22.6gb in 6 days (yes, GB not MB).

When I've looked into it, it's all been consumed by Googlebot.

I've been into my "Recent Visitors" log to see what IP addresses have been accessing what web pages on my server, and I'd say 95% of the hits are like this one "/index.php?find-new/1915589/posts&page=6"

Can I use robots.txt to disallow /index.php?find-new/ ???

I've had to block the IP of all Googlebots for now to stop it consuming my bandwidth or earning me another account suspension but I don't want to keep Google blocked eternally.

If the above would not work, what can I do to stop it crawling the "new posts" URL as this is obviously going to be constantly changing.
 

Mouth

Well-known member
#2
Can I use robots.txt to disallow /index.php?find-new/ ???
Yes.
Code:
User-agent: *
Disallow: /account/
Disallow: /admin.php
Disallow: /attachments/
Disallow: /conversations*
Disallow: /cron.php
Disallow: /find-new/
Disallow: /goto/
Disallow: /login*
Disallow: /logout*
Disallow: /lost-password*
Disallow: /members/
Disallow: /online/
Disallow: /posts/
Disallow: /proxy.php*
Disallow: /resources/*/download
Disallow: /search*
 
#4
Make sure you confirm that those ips are actually Google and not another bot pretending to be Google.
I checked and double checked and they are definitely Googlebots. I was logged into my Google Search account and I could see that the server connection stopped when I blocked the IP addresses.
 
#5
I just wanted to add, we went from 5-7K pages crawled by Googlebot/ day to 1.9M yesterday on a forum with ~36K indexed pages. That was around 20GB worth of data transfer.

The other major issue is that the nginx logs were pushing about 20MB/ hour. Unless you are on a low-end SSD, that is not too much. What you can run into is where the writes will fill up a disk if you are not rotating logs or using a remote syslog.

Once the disk fills, your forum will go down.

It seems like this is new Googlebot behavior March/ April 2017.