In this article I'm going to teach you how you can identify and then block bad robots from your website, that could be possibly using up system resources on your server.

What is a bad robot?

There are many reasons that an automated robot would be trying to crawl through your website, the most common is for the large search engines such as Google or Bing to be able to find all the content on your website, so they can then serve it up to their users via search queries they are doing on those services.

These robots are supposed to follow rules that you place in a robots.txt file, such as how frequently they are allowed to request pages, and from what directories they are allowed to crawl through. They should also be supplying a consistent valid User-Agent string that identifies the requests as a bot request.

A bad robot usually will ignore robots.txt rules, request pages too quickly, re-visit your site too frequently, attempt to harvest email addresses, or in general simply provide no value back to your website. When a good robot crawls your site, this is typically so other people can find your content and then be directed to it from a search engine. When a bad robot is crawling through your site it could be for malicious intensions such as trying to copy your content so that they can use it as their own.

Identify a bad robot

Using the steps below, I'll show you some good steps to take in order to verify if a robot is a good or bad one.

Please note in order to follow these steps you would need to be on either a VPS (Virtual Private Server) or dedicated server that has SSH access. If you're on a shared server you could read our guide on enabling raw access log archiving in cPanel to be able to view the same data, but it would have to be on your own local computer.

  1. Login to your server via SSH.
  2. Navigate to your user's home directory where the Apache access logs are stored, in this case our username is userna5, so we'll use the following command:

    cd ~userna5/access-logs/

  3. We can now use the following command to see all User-Agents that have requests to our example.com website:

    cat example.com | awk -F\" '{print $6}' | sort | uniq -c | sort -n

    This gives us back the output:

    638 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
    1015 msnbot-UDiscovery/2.0b (+http://search.msn.com/msnbot.htm)
    1344 Mozilla/5.0 (en-US) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229 Safari/537.4 pss-webkit-request
    21937 -

    So in this case, there have been (21,937) requests that aren't supplying a User-Agent string. This is an immediate red flag, as typically any human visitor requesting a page from your website should have the User-Agent of their web-browser in each request.

    We can also see that the next highest level of requests came from something calling itself pss-webkit-request, followed by msnbot-UDiscovery/2.0b, and then Googlebot/2.1

  4. First we can see all of the requests that didn't provide a User-Agent string, and then view all of the unique IP addresses that sent those requests in, with the following code:

    cat example.com | awk -F\" '$6 ~ "-"' | awk '{print $1}' | sort -n | uniq -c | sort -n

    In this example here are the top IPs that had requests without a User-Agent string:

    421 74.125.176.94
    434 74.125.176.85
    463 74.125.176.95

  5. We can now search for these IP address against our access log to see what might going on with their requests. The following command is going to look for what User-Agent strings are coming from the one 74.125.176.95 IP address which had the most requests:

    grep 74.125.176.95 example.com | awk -F\" '{print $6}' | sort | uniq -c | sort -n

    This gives us back:

    7 Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
    11 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17
    29 Mozilla/5.0 (en-US) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229 Safari/537.4 pss-webkit-request
    434 -

    Again this is another red flag, typically requests coming from a single IP address will use the same User-Agent string for each request. We can go a step further and run the following command to find out some information about the IP address:

    whois 74.125.176.95

    Some of the pertinent information that command gives us back for that IP is:

    NetRange: 74.125.0.0 - 74.125.255.255
    NetName: GOOGLE
    OrgName: Google Inc.
    OrgId: GOGL
    Address: 1600 Amphitheatre Parkway
    City: Mountain View
    StateProv: CA
    PostalCode: 94043
    Country: US
    RegDate: 2000-03-30
    Updated: 2011-09-24
    Ref: http://whois.arin.net/rest/org/GOGL

    So this is an IP address that belongs to Google, and so are all the other IPs we saw coming from the 74.125 IP range. However this isn't coming from the official Googlebot crawler which would be identified as such by the User-Agent string, but instead these are requests from their AppEngine/Cloud service.

    So these requests are from custom crawlers more than likely that other users have made, and in some cases they could simply be trying to index your content for their own purposes instead of providing links back to you, which would fall under our definition of a bad robot.

Block a bad robot

Now that you understand a bit about how to identify a possible bad robot, the next step would be to probably block that bad robot if they've been causing problems with your website usage.

Using the steps below I'll show you how we can block the entire range of 74.125 IPs we were seeing from accessing the example.com website, but still allow them to request if they do happen to mention Google in their User-Agent string of the request.

  1. Edit the .htaccess file for your website with the following command:

    vim ~userna5/public_html/.htaccess

    Once the vim text-editor has loaded the file, hit i to enter Insert mode, enter in the following code (in most SSH clients you can also right-click to paste text from the clipboard):

    ErrorDocument 503 "Site disabled for crawling"
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} !^.*(Google).*$
    RewriteCond %{REMOTE_ADDR} ^74.125
    RewriteRule .* - [R=503,L]

    Once you've entered that information in, hit Esc to exit Insert mode, then hold down Shift and type in ZZ to save the file.

    These rules will first check if the User-Agent has the word Google in it anywhere, if it does not, it moves onto the next rule, which checks to see if the IP address begins with 74.125. If the IP matches then it will serve them a 503 response of Site disabled for crawling which uses up very minimal server resources, instead of allowing them to hit your various website pages that could be causing a lot of usage.

  2. As a bonus you can then use the following command to check up on how many requests you're saving your server from having to serve to these bad bots with this command:

    cat ~userna5/access-logs/example.com | grep "74.125" | awk '$9 ~ 503' | cut -d[ -f2 | cut -d] -f1 | awk -F: '{print $2":00"}' | sort -n | uniq -c | sed 's/[ ]*//'

    This gives you back the amount of bots blocked per hour with your rule:

    2637 14:00
    2823 15:00
    2185 16:00

You should now understand how to identify and then block bad robots from causing usage problems on your website with their excessive requests.

 

Did you find this article helpful?

We value your feedback!

Why was this article not helpful? (Check all that apply)
The article is too difficult or too technical to follow.
There is a step or detail missing from the instructions.
The information is incorrect or out-of-date.
It does not resolve the question/problem I have.
How did you find this article?
Please tell us how we can improve this article:
Email Address
Name

new! - Enter your name and email address above and we will post your feedback in the comments on this page!

Like this Article?

Comments

Post a comment
n/a Points
2014-05-14 7:47 am

I use PHP and filter on IP address and User-Agent to make the 'bad bot' wait 999 seconds

and return 0 bytes.

http://gelm.net/How-to-block-Baidu-with-PHP.htm

Post a Comment

Name:
Email Address:
Phone Number:
Comment:
Submit

Please note: Your name and comment will be displayed, but we will not show your email address.

News / Announcements

WordPress wp-login.php brute force attack
Updated 2014-07-17 06:43 pm EST
Hits: 200892

Related Questions

Here are a few questions related to this article that our customers have asked:
Ooops! It looks like there are no questions about this page.
Would you like to ask a question about this page? If so, click the button below!
Ask a Question

Need more Help?

Search

Ask the Community!

Get help with your questions from our community of like-minded hosting users and InMotion Hosting Staff.

Current Customers

Chat: Click to Chat Now E-mail: support@InMotionHosting.com
Call: 888-321-HOST (4678) Ticket: Submit a Support Ticket

Not a Customer?

Get web hosting from a company that is here to help. Sign up today!