Higher than normal server load due to increased http requests - how to discover what/who?

CTXMedia

Well-known member
Debian server with 8GB - normal server load around 0.90

In recent weeks my server load has been averaging 2.80 (sometimes going quite a bit higher) and I have many more apache processes showing in #top than before and a higher number of tasks (usually around 130 with 1 running but recently increased to around 140-180 with up to 6 running).

I've done a "tail -f" on the access_log files for each of my sites and the smaller sites are not showing any great amount of traffic so I've narrowed the actual target site down to my largest one, CycleChat.

There are too many requests to gain anything useful from watching the tail and despite having a command to summarise the connections [ netstat -alntp | grep :80 | wc -l ] when I try to establish which IP addresses have the most requests to port 80 [ netstat -plan|grep :80|awk {'print $5'}|cut -d: -f 1|sort|uniq -c|sort -nk 1 ] the figures don't add up - there are far more connections than being summed-up by the IP shortlist.

I'm assuming it's a bot or scraper but how do I find out which IP/s are responsible so I can add them to my firewall?

Any help appreciated. (y)

Thanks,
Shaun :D
 
Do you have any / many requests from this is range? 180.76.0.0/16

That is the Chinese bot Baidu and when I blocked them from crawling my site, the average load decreased as they are an aggressive spider....and I don't have any members in China.
 
Thanks, but I've already got that range blocked (along with the 220. range I see for their 2.0 spider). I don't feel the need to give a portion of my bandwidth to a Chinese search engine that my sites don't benefit from. ISTR blocking the Russian one too; Yandex.

The thing I'm trying to do is discover who/what is making these additional requests and the things I've tried so far haven't enabled me to pin it down well enough (I don't fancy spending all my evenings this week trawling through GB's of server logs trying to work it our manually ;) ).
 
Can you not do a grep on the website access log and look for the top 10 ip addresses that are appearing in there, and work backwards on that?
 
Something like this:

create a bash script (I use vim)
vim ipadd.sh

Add this, and give the FILE variable the full path to the access log you want to check
Code:
FILE=/home/z22se/access-logs/z22se.co.uk;
for ip in `cat $FILE |cut -d ' ' -f 1 |sort |uniq`;
do { COUNT=`grep ^$ip $FILE |wc -l`;
if [[ "$COUNT" -gt "500" ]]; then echo "$COUNT:  $ip";
fi }; done

make it executable, and then run it. I then did a whois on the result(s) to see who it is.
Code:
z22se@z22se.co.uk [~/scripts]# chmod +x ipadd.sh
z22se@z22se.co.uk [~/scripts]# ./ipadd.sh
1266:  66.249.71.15
z22se@z22se.co.uk [~/scripts]# whois 66.249.71.15
[Querying whois.arin.net]
[whois.arin.net]
#
# Query terms are ambiguous.  The query is assumed to be:
#    "n 66.249.71.15"
#
# Use "?" to get help.
#
 
#
# The following results may also be obtained via:
# http://whois.arin.net/rest/nets;q=66.249.71.15?showDetails=true&showARIN=false&ext=netref2
#
 
NetRange:      66.249.64.0 - 66.249.95.255
CIDR:          66.249.64.0/19
OriginAS:   
NetName:        GOOGLE
NetHandle:      NET-66-249-64-0-1
Parent:        NET-66-0-0-0-0
NetType:        Direct Allocation
RegDate:        2004-03-05
Updated:        2012-02-24
Ref:            http://whois.arin.net/rest/net/NET-66-249-64-0-1
 
 
OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:          Mountain View
StateProv:      CA
PostalCode:    94043
Country:        US
RegDate:        2000-03-30
Updated:        2011-09-24
Ref:            http://whois.arin.net/rest/org/GOGL
 
OrgAbuseHandle: ZG39-ARIN
OrgAbuseName:  Google Inc
OrgAbusePhone:  +1-650-253-0000
OrgAbuseEmail:  arin-contact@google.com
OrgAbuseRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
 
OrgTechHandle: ZG39-ARIN
OrgTechName:  Google Inc
OrgTechPhone:  +1-650-253-0000
OrgTechEmail:  arin-contact@google.com
OrgTechRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
 
#
# ARIN WHOIS data and services are subject to the Terms of Use
# available at: https://www.arin.net/whois_tou.html
#
 
z22se@z22se.co.uk [~/scripts]#
 
I've fixed my previous post, as I was trying to do it on the iPad, and copy / paste from "prompt" isn't the best :confused:
 
Does google analytics have any of this info???

I have httpd logging turned off on my server - for speeds sake.
 
If the user agent is loading the java then yes, it should be in analytics.....but I think most bots don't load java. I know I never saw Baidu in my analytic report
 
No, not using a CDN. I've come across a handy command to count unique IPs in your log file:

Code:
cat /path/to/access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail

Just about to run it on a copy of the CC log file but at 6GB it might take a while ... ;)
 
No, not using a CDN. I've come across a handy command to count unique IPs in your log file:

Code:
cat /path/to/access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail

Just about to run it on a copy of the CC log file but at 6GB it might take a while ... ;)

Are you rotating your logs? Apache accessing large acces_log files will increase load as well. :)
 
Hmmm ... it's looking like Googlebot - with 10x more requests than the next highest requesting IP address - but that's over a 20 day period.

I'm tailing the access_log to a separate text file for the next half hour to check current requests to see if the IPs match up?
 
Are you rotating your logs? Apache accessing large acces_log files will increase load as well. :)

Yeah, logs are being rotated - however CycleChat is very busy so I perhaps need to rotate them more often.

TBH I've also considered turning them off as I rarely reference them for any historical data (I use Analytics for that) and would gain the CPU/IO by turning them off.
 
Just checked - logs are set to rotate monthly. Would you recommend changing them to weekly or even daily?

[Edit: The logging hasn't been altered on my server in about 5 years ... lol - I used to use it all the time to check traffic etc. when I first got the server but never reference it at all nowadays. In that time CycleChat has gone from a few requests per day to over 15,000 per day (and that's just the ones Analytics catches ... ;) )]
 
Hmmm ... it's looking like Googlebot - with 10x more requests than the next highest requesting IP address - but that's over a 20 day period.

I'm tailing the access_log to a separate text file for the next half hour to check current requests to see if the IPs match up?
If you've got Google webmasters set up, have a look at the crawl rate in there to see what they think they are doing.

Did you get anything different only looking at a 30 minute snap shot?
 
Top Bottom