Higher than normal server load due to increased http requests - how to discover what/who?

CTXMedia · Jun 20, 2012

Debian server with 8GB - normal server load around 0.90

In recent weeks my server load has been averaging 2.80 (sometimes going quite a bit higher) and I have many more apache processes showing in #top than before and a higher number of tasks (usually around 130 with 1 running but recently increased to around 140-180 with up to 6 running).

I've done a "tail -f" on the access_log files for each of my sites and the smaller sites are not showing any great amount of traffic so I've narrowed the actual target site down to my largest one, CycleChat.

There are too many requests to gain anything useful from watching the tail and despite having a command to summarise the connections [ netstat -alntp | grep :80 | wc -l ] when I try to establish which IP addresses have the most requests to port 80 [ netstat -plan|grep :80|awk {'print $5'}|cut -d: -f 1|sort|uniq -c|sort -nk 1 ] the figures don't add up - there are far more connections than being summed-up by the IP shortlist.

I'm assuming it's a bot or scraper but how do I find out which IP/s are responsible so I can add them to my firewall?

Any help appreciated.

Thanks,
Shaun

Digital Doctor · Jun 20, 2012

Are there any *SERVER* tools that block scraping ?

MattW · Jun 20, 2012

Do you have any / many requests from this is range? 180.76.0.0/16

That is the Chinese bot Baidu and when I blocked them from crawling my site, the average load decreased as they are an aggressive spider....and I don't have any members in China.

CTXMedia · Jun 20, 2012

Thanks, but I've already got that range blocked (along with the 220. range I see for their 2.0 spider). I don't feel the need to give a portion of my bandwidth to a Chinese search engine that my sites don't benefit from. ISTR blocking the Russian one too; Yandex.

The thing I'm trying to do is discover who/what is making these additional requests and the things I've tried so far haven't enabled me to pin it down well enough (I don't fancy spending all my evenings this week trawling through GB's of server logs trying to work it our manually

).

MattW · Jun 20, 2012

Can you not do a grep on the website access log and look for the top 10 ip addresses that are appearing in there, and work backwards on that?

CTXMedia · Jun 20, 2012

MattW said:
Can you not do a grep on the website access log and look for the top 10 ip addresses that are appearing in there, and work backwards on that?

I could if I knew how to get grep to give me the top 10 addresses?

MattW · Jun 20, 2012

Something like this:

create a bash script (I use vim)
vim ipadd.sh

Add this, and give the FILE variable the full path to the access log you want to check

Code:

FILE=/home/z22se/access-logs/z22se.co.uk;
for ip in `cat $FILE |cut -d ' ' -f 1 |sort |uniq`;
do { COUNT=`grep ^$ip $FILE |wc -l`;
if [[ "$COUNT" -gt "500" ]]; then echo "$COUNT:  $ip";
fi }; done

make it executable, and then run it. I then did a whois on the result(s) to see who it is.

Code:

z22se@z22se.co.uk [~/scripts]# chmod +x ipadd.sh
z22se@z22se.co.uk [~/scripts]# ./ipadd.sh
1266:  66.249.71.15
z22se@z22se.co.uk [~/scripts]# whois 66.249.71.15
[Querying whois.arin.net]
[whois.arin.net]
#
# Query terms are ambiguous.  The query is assumed to be:
#    "n 66.249.71.15"
#
# Use "?" to get help.
#
 
#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=66.249.71.15?showDetails=true&showARIN=false&ext=netref2
#
 
NetRange:      66.249.64.0 - 66.249.95.255
CIDR:          66.249.64.0/19
OriginAS:   
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:        NET-66-0-0-0-0
NetType:        Direct Allocation
RegDate:        2004-03-05
Updated:        2012-02-24
Ref:            http://whois.arin.net/rest/net/NET-66-249-64-0-1
 
 
OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:          Mountain View
StateProv:      CA
PostalCode:    94043
Country:        US
RegDate:        2000-03-30
Updated:        2011-09-24
Ref:            http://whois.arin.net/rest/org/GOGL
 
OrgAbuseHandle: ZG39-ARIN
OrgAbuseName:  Google Inc
OrgAbusePhone:  +1-650-253-0000
OrgAbuseEmail:  arin-contact@google.com
OrgAbuseRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
 
OrgTechHandle: ZG39-ARIN
OrgTechName:  Google Inc
OrgTechPhone:  +1-650-253-0000
OrgTechEmail:  arin-contact@google.com
OrgTechRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
 
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
 
z22se@z22se.co.uk [~/scripts]#

Digital Doctor · Jun 20, 2012

This might help
http://www.unix.com/shell-programming-scripting/185469-grep-ip-address-file.html

MattW · Jun 20, 2012

I've fixed my previous post, as I was trying to do it on the iPad, and copy / paste from "prompt" isn't the best

craigiri · Jun 20, 2012

Does google analytics have any of this info???

I have httpd logging turned off on my server - for speeds sake.

CTXMedia · Jun 20, 2012

craigiri said:
Does google analytics have any of this info???

Hmmm ... I never thought about looking. I'll have a dig around later tonight.

MattW · Jun 20, 2012

If the user agent is loading the java then yes, it should be in analytics.....but I think most bots don't load java. I know I never saw Baidu in my analytic report

RobParker · Jun 20, 2012

Did you get anywhere with this? We've seen something similar in the past few days...

Brent W · Jun 20, 2012

Are you using a CDN. Will help drop some apache requests.

CTXMedia · Jun 20, 2012

No, not using a CDN. I've come across a handy command to count unique IPs in your log file:

Code:

cat /path/to/access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail

Just about to run it on a copy of the CC log file but at 6GB it might take a while ...

Brent W · Jun 20, 2012

Clickfinity said:
No, not using a CDN. I've come across a handy command to count unique IPs in your log file:

Code:

cat /path/to/access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail

Just about to run it on a copy of the CC log file but at 6GB it might take a while ...

Are you rotating your logs? Apache accessing large acces_log files will increase load as well.

CTXMedia · Jun 20, 2012

Hmmm ... it's looking like Googlebot - with 10x more requests than the next highest requesting IP address - but that's over a 20 day period.

I'm tailing the access_log to a separate text file for the next half hour to check current requests to see if the IPs match up?

CTXMedia · Jun 20, 2012

BamaStangGuy said:
Are you rotating your logs? Apache accessing large acces_log files will increase load as well.

Yeah, logs are being rotated - however CycleChat is very busy so I perhaps need to rotate them more often.

TBH I've also considered turning them off as I rarely reference them for any historical data (I use Analytics for that) and would gain the CPU/IO by turning them off.

CTXMedia · Jun 20, 2012

Just checked - logs are set to rotate monthly. Would you recommend changing them to weekly or even daily?

[Edit: The logging hasn't been altered on my server in about 5 years ... lol - I used to use it all the time to check traffic etc. when I first got the server but never reference it at all nowadays. In that time CycleChat has gone from a few requests per day to over 15,000 per day (and that's just the ones Analytics catches ...

)]

MattW · Jun 21, 2012

Clickfinity said:
Hmmm ... it's looking like Googlebot - with 10x more requests than the next highest requesting IP address - but that's over a 20 day period.

I'm tailing the access_log to a separate text file for the next half hour to check current requests to see if the IPs match up?

If you've got Google webmasters set up, have a look at the crawl rate in there to see what they think they are doing.

Did you get anything different only looking at a 30 minute snap shot?

Higher than normal server load due to increased http requests - how to discover what/who?

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

We value your privacy