Baidu crawling my site like crazy

Tried a robots.txt file with this inside?

PHP:
User-agent: Baiduspider
Disallow: /
 
Seeing how they're not following robots.txt at all, .. after some sneaky testing for a week, .. I have no problem blocking them a 100% I don't know who they are, nobody I know uses their services, and if Facebook has an interest that's enough reason for me to block it.

Glad to find out they're not the only ones hammering my server. They're in the top 3 of not-default user agents that constantly hammer every possible combination of every link. Even stuff that has nothing to do with search engines.
 
I had no idea there were soooo many of them online at once. I was deriving my input from viewing the apache logs.

I fire wall blocked 2 full ranges xxx.xxx.xx.0/24 that they were on, and the server load went down a *huge* amount.
 
I've filed a complaint with the data center contact block that owns the IP ranges, waiting for their review report. Though I am sure they pay so much money that the data center doesn't care. Politics.
 
Why would you want to block Baidu? Yes, really, I'm asking that question.

Mark Zuckerberg took a visit to the Baidu headquarters a short while ago. This tells me that Facebook is interested in acquiring Baidu down the line. This in turn tells me that Facebook's choice of search engines is Baidu... So, if you want more traffic to your site, Baidu's interested in seeing what you've got. And if successful, they'll direct more 'human' to your site, just like google before them.

It's more about their present actions. If they're acting annoying or suspicious right now, I'll block them. Google bots never act suspicious. If in the future Baidu becomes the main stream search engine everyone use, then I'll consider unblocking them.

Anyway, I'm checking my guest IP right now. Haven't found any Baidu yet fortunately. But I have quite a lot of msn search bots (5 of them) looking at random threads.
 
I'm having the same issue here too - I remember reading the thread. Currently I have about 30 Baidu spiders on my site and the now I'm getting crawled by some other Chinese based sites despite the fact I have robots.txt blocking them - so many that at one point this week my server spiked hugely. I guess I'm going to have to block the IP's at the firewall level - what a pain in the arse!
 
I guess via IP addresses in your .htaccess file.

Guys, what is the deal with blocking IP ranges? for example I see the Baidu range being 119.63.196.xxx but if I ban at 119.63.196. will that cause non baidu computers to be blocked? The reason I ask is because the Baidu spider is in Japan, and I too live in Japan and my server hosts some sites based in Japan so I don't want to block them. Though if 119.63.196. is the organisation IP then I'm happy to block it
 
I guess via IP addresses in your .htaccess file.

Guys, what is the deal with blocking IP ranges? for example I see the Baidu range being 119.63.196.xxx but if I ban at 119.63.196. will that cause non baidu computers to be blocked? The reason I ask is because the Baidu spider is in Japan, and I too live in Japan and my server hosts some sites based in Japan so I don't want to block them. Though if 119.63.196. is the organisation IP then I'm happy to block it

There's two versions, Baidu Japan and Baidu China. Whether they share results or not, I have no idea.

In my experience, Baidu honors robots.txt but suffers from bad actors riding their coattails (which you should absolutely nuke the bad actors by IP, imo). I start my robots.txt with bots that are allowed and end with:

Code:
# Disallow all others
User-agent: *
Disallow: /

This has been pretty successful for me. In fact, even a large number of russian bots seem to comply.
 
Top Bottom