1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Higher than normal server load due to increased http requests - how to discover what/who?

Discussion in 'Server Configuration and Hosting' started by CyclingTribe, Jun 20, 2012.

  1. CyclingTribe

    CyclingTribe Well-Known Member

    Debian server with 8GB - normal server load around 0.90

    In recent weeks my server load has been averaging 2.80 (sometimes going quite a bit higher) and I have many more apache processes showing in #top than before and a higher number of tasks (usually around 130 with 1 running but recently increased to around 140-180 with up to 6 running).

    I've done a "tail -f" on the access_log files for each of my sites and the smaller sites are not showing any great amount of traffic so I've narrowed the actual target site down to my largest one, CycleChat.

    There are too many requests to gain anything useful from watching the tail and despite having a command to summarise the connections [ netstat -alntp | grep :80 | wc -l ] when I try to establish which IP addresses have the most requests to port 80 [ netstat -plan|grep :80|awk {'print $5'}|cut -d: -f 1|sort|uniq -c|sort -nk 1 ] the figures don't add up - there are far more connections than being summed-up by the IP shortlist.

    I'm assuming it's a bot or scraper but how do I find out which IP/s are responsible so I can add them to my firewall?

    Any help appreciated. (y)

    Shaun :D
  2. Digital Doctor

    Digital Doctor Well-Known Member

    Are there any *SERVER* tools that block scraping ?
  3. MattW

    MattW Well-Known Member

    Do you have any / many requests from this is range?

    That is the Chinese bot Baidu and when I blocked them from crawling my site, the average load decreased as they are an aggressive spider....and I don't have any members in China.
  4. CyclingTribe

    CyclingTribe Well-Known Member

    Thanks, but I've already got that range blocked (along with the 220. range I see for their 2.0 spider). I don't feel the need to give a portion of my bandwidth to a Chinese search engine that my sites don't benefit from. ISTR blocking the Russian one too; Yandex.

    The thing I'm trying to do is discover who/what is making these additional requests and the things I've tried so far haven't enabled me to pin it down well enough (I don't fancy spending all my evenings this week trawling through GB's of server logs trying to work it our manually ;) ).
  5. MattW

    MattW Well-Known Member

    Can you not do a grep on the website access log and look for the top 10 ip addresses that are appearing in there, and work backwards on that?
    Digital Doctor likes this.
  6. CyclingTribe

    CyclingTribe Well-Known Member

    I could if I knew how to get grep to give me the top 10 addresses? ;)
  7. MattW

    MattW Well-Known Member

    Something like this:

    create a bash script (I use vim)
    vim ipadd.sh

    Add this, and give the FILE variable the full path to the access log you want to check
    for ip in `cat $FILE |cut -d ' ' -f 1 |sort |uniq`;
    do { COUNT=`grep ^$ip $FILE |wc -l`;
    if [[ "$COUNT" -gt "500" ]]; then echo "$COUNT:  $ip";
    fi }; done
    make it executable, and then run it. I then did a whois on the result(s) to see who it is.
    z22se@z22se.co.uk [~/scripts]# chmod +x ipadd.sh
    z22se@z22se.co.uk [~/scripts]# ./ipadd.sh
    z22se@z22se.co.uk [~/scripts]# whois
    [Querying whois.arin.net]
    # Query terms are ambiguous.  The query is assumed to be:
    #    "n"
    # Use "?" to get help.
    # The following results may also be obtained via:
    # http://whois.arin.net/rest/nets;q=
    NetRange: -
    NetName:        GOOGLE
    NetHandle:      NET-66-249-64-0-1
    Parent:        NET-66-0-0-0-0
    NetType:        Direct Allocation
    RegDate:        2004-03-05
    Updated:        2012-02-24
    Ref:            http://whois.arin.net/rest/net/NET-66-249-64-0-1
    OrgName:        Google Inc.
    OrgId:          GOGL
    Address:        1600 Amphitheatre Parkway
    City:          Mountain View
    StateProv:      CA
    PostalCode:    94043
    Country:        US
    RegDate:        2000-03-30
    Updated:        2011-09-24
    Ref:            http://whois.arin.net/rest/org/GOGL
    OrgAbuseHandle: ZG39-ARIN
    OrgAbuseName:  Google Inc
    OrgAbusePhone:  +1-650-253-0000
    OrgAbuseEmail:  arin-contact@google.com
    OrgAbuseRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
    OrgTechHandle: ZG39-ARIN
    OrgTechName:  Google Inc
    OrgTechPhone:  +1-650-253-0000
    OrgTechEmail:  arin-contact@google.com
    OrgTechRef:    http://whois.arin.net/rest/poc/ZG39-ARIN
    # ARIN WHOIS data and services are subject to the Terms of Use
    # available at: https://www.arin.net/whois_tou.html
    z22se@z22se.co.uk [~/scripts]#
    gordy and Digital Doctor like this.
  8. Digital Doctor

    Digital Doctor Well-Known Member

  9. MattW

    MattW Well-Known Member

    I've fixed my previous post, as I was trying to do it on the iPad, and copy / paste from "prompt" isn't the best :confused:
  10. craigiri

    craigiri Well-Known Member

    Does google analytics have any of this info???

    I have httpd logging turned off on my server - for speeds sake.
  11. CyclingTribe

    CyclingTribe Well-Known Member

    Hmmm ... I never thought about looking. I'll have a dig around later tonight. (y)
  12. MattW

    MattW Well-Known Member

    If the user agent is loading the java then yes, it should be in analytics.....but I think most bots don't load java. I know I never saw Baidu in my analytic report
  13. RobParker

    RobParker Well-Known Member

    Did you get anywhere with this? We've seen something similar in the past few days...
  14. BamaStangGuy

    BamaStangGuy Well-Known Member

    Are you using a CDN. Will help drop some apache requests.
  15. CyclingTribe

    CyclingTribe Well-Known Member

    No, not using a CDN. I've come across a handy command to count unique IPs in your log file:

    cat /path/to/access_log | awk '{print $1}' | sort | uniq -c | sort -n | tail
    Just about to run it on a copy of the CC log file but at 6GB it might take a while ... ;)
  16. BamaStangGuy

    BamaStangGuy Well-Known Member

    Are you rotating your logs? Apache accessing large acces_log files will increase load as well. :)
  17. CyclingTribe

    CyclingTribe Well-Known Member

    Hmmm ... it's looking like Googlebot - with 10x more requests than the next highest requesting IP address - but that's over a 20 day period.

    I'm tailing the access_log to a separate text file for the next half hour to check current requests to see if the IPs match up?
  18. CyclingTribe

    CyclingTribe Well-Known Member

    Yeah, logs are being rotated - however CycleChat is very busy so I perhaps need to rotate them more often.

    TBH I've also considered turning them off as I rarely reference them for any historical data (I use Analytics for that) and would gain the CPU/IO by turning them off.
  19. CyclingTribe

    CyclingTribe Well-Known Member

    Just checked - logs are set to rotate monthly. Would you recommend changing them to weekly or even daily?

    [Edit: The logging hasn't been altered on my server in about 5 years ... lol - I used to use it all the time to check traffic etc. when I first got the server but never reference it at all nowadays. In that time CycleChat has gone from a few requests per day to over 15,000 per day (and that's just the ones Analytics catches ... ;) )]
  20. MattW

    MattW Well-Known Member

    If you've got Google webmasters set up, have a look at the crawl rate in there to see what they think they are doing.

    Did you get anything different only looking at a 30 minute snap shot?

Share This Page