Baidu crawling my site like crazy

Luke B

Active member
Hello all,

I noticed over the last couple of days a crazy amount of hits from Baidu. Its been staying consistent pretty much all day and not much content has been added at all.

Is this typical?
 
My site is constantly being crawled by Baidu. I've no idea what they are indexing, or not indexing.

That seems to be the only spider that also constantly checks to see if my site is running without asking for content.
 
My site is constantly being crawled by Baidu. I've no idea what they are indexing, or not indexing.

That seems to be the only spider that also constantly checks to see if my site is running without asking for content.
Exactly what I'm experiencing. They do this all the time. Maybe they aren't as efficient as Google bot?
 
Possibly someone with more experience could give input. I've just started looking at these types of issues..
 
Baidu is apparently running amok :) I started to notice it a couple of days ago.

Code:
<Directory /...path/to/..>
  Order allow,deny
  Allow from all
  Deny from 119.63.196.
</Directory>

And gone are the buggers...
 
I firewalled most of the IPs at the server level, but wanted a few of them to continue to visit until I could understand what they are doing. Things like 'do they respect robots.txt? 'why are they doing status checks so often?'
 
Why would you want to block Baidu? Yes, really, I'm asking that question.

Mark Zuckerberg took a visit to the Baidu headquarters a short while ago. This tells me that Facebook is interested in acquiring Baidu down the line. This in turn tells me that Facebook's choice of search engines is Baidu... So, if you want more traffic to your site, Baidu's interested in seeing what you've got. And if successful, they'll direct more 'human' to your site, just like google before them.
 
Why would you want to block Baidu? Yes, really, I'm asking that question.
Because I can.
Mark Zuckerberg took a visit to the Baidu headquarters a short while ago. This tells me that Facebook is interested in acquiring Baidu down the line. This in turn tells me that Facebook's choice of search engines is Baidu... So, if you want more traffic to your site, Baidu's interested in seeing what you've got. And if successful, they'll direct more 'human' to your site, just like google before them.
And this has exactly what to do with the OP's request?
 
I firewalled most of the IPs at the server level, but wanted a few of them to continue to visit until I could understand what they are doing. Things like 'do they respect robots.txt? 'why are they doing status checks so often?'
Afaik, it does respect robots.txt, but it will still hit your site like crazy, generating lots of unneeded traffic. Blocking it before it even sees the page makes sense.
 
And this has exactly what to do with the OP's request?
...Why the question? It has nothing to do with the request, I'm just stating that it's not really a good idea to block them especially when the future looks better with baidu as your search "king." Because you're actually getting visitors FROM other countries - especially China.

China's business prospects are so busy right now, that if you're an online business, and you're dealing with worldwide community - once a visitor likes your site, so much that he/she tells her friends about it, your site - BOOM. And then starts bringing in more visitors like a wildfire.

Oh, and FYI: Baidu is 6th in alexia rankings, according to wikipedia.

But hey, it's not my site. I'm just offering an idea here.
 
...Why the question? It has nothing to do with the request, I'm just stating that it's not really a good idea to block them especially when the future looks better with baidu as your search "king." Because you're actually getting visitors FROM other countries - especially China.

China's business prospects are so busy right now, that if you're an online business, and you're dealing with worldwide community - once a visitor likes your site, so much that he/she tells her friends about it, your site - BOOM. And then starts bringing in more visitors like a wildfire.

Oh, and FYI: Baidu is 6th in alexia rankings, according to wikipedia.

But hey, it's not my site. I'm just offering an idea here.

What if your community is related about a small city within one of the states ....how will baidu help my forum ?
 
What if your community is related about a small city within one of the states ....how will baidu help my forum ?
I've been asking myself the same thing about an XF site (recently converted from vB) that I am running for my wife's church. How the heck are they even finding it let alone why they are indexing it puzzles me.
 
I've been asking myself the same thing about an XF site (recently converted from vB) that I am running for my wife's church. How the heck are they even finding it let alone why they are indexing it puzzles me.
Have you gone through your cpanel access log?

Baidu is constantly trying to access the content of my site like this:
Code:
119.63.196.41 - - [24/Jul/2011:05:13:55 -0700] "GET /misc/style?style_id=4&redirect=%2Fmisc%2Fstyle%3Fredirect%3D%252Fthreads%random-thread-title252.16076%252Fpage-4 HTTP/1.1" 303 - "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
 
Have you gone through your cpanel access log?
I have and it looks like they are hitting every link off of the main URL page. Thinking out loud... they are likely picking up the URL from the church's Twitter account and are then doing a blind crawl of the site regardless of the sites locale or content.
 
No I doubt that because they used to crawl my site like mad and I don't even think there is a single link from twitter linking to my site.

Baidu is apparently running amok :) I started to notice it a couple of days ago.

Code:
<Directory /...path/to/..>
  Order allow,deny
  Allow from all
  Deny from 119.63.196.
</Directory>

And gone are the buggers...
this is the first thing I do when an ip range pisses me off.
Code:
ip route add blackhole 119.63.196.0/24
 
I have and it looks like they are hitting every link off of the main URL page. Thinking out loud... they are likely picking up the URL from the church's Twitter account and are then doing a blind crawl of the site regardless of the sites locale or content.
I've actually thought about this exact thing. It could very well be a coincidence but it seems like every time I post something on Twitter, Baidu come a crawl'n. But then again, I didn't post anything yesterday and Baidu was all over SLRuser.

I'm not really concerned only curious as to why so much.
 
Top Bottom