In order for your website to be found by other people, search engine crawlers also sometimes referred to as bots or spiders, will crawl your website looking for updated text and links to update their search indexes with.

Control search engine crawlers with robots.txt file

Website owners, can instruct search engines on how they should crawl a website, by using a robots.txt file.

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

It's important to know robots.txt rules don't have to be followed by bots, and they are a guideline.

For instance to set a Crawl-delay for Google this must be done in the Google Webmaster tools.

For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.

Edit or create robots.txt file

The robots.txt file needs to be at the root of your site. If your domain was example.com it should be found:

On your website:

http://example.com/robots.txt

On your server:

/home/userna5/public_html/robots.txt

You can also create a new file and call it robots.txt as just a plain-text file if you don't already have one.

Search engine User-agents

The most common rule you'd use in a robots.txt file is based on the User-agent of the search engine crawler.

Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:

Top 3 US search engine User-agents:

Googlebot
Yahoo! Slurp
bingbot

Common search engine User-agents blocked:

AhrefsBot
Baiduspider
Ezooms
MJ12bot
YandexBot

Search engine crawler access via robots.txt file

There are quite a few options when it comes to controling how your site is crawled with the robots.txt file.

The User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.

Disallow: sets the files or folders that are not allowed to be crawled.

Set a crawl delay for all search engines:

If you had 1,000 pages on your website, a search engine could potentially index your entire site in a few minutes.

However this could cause high system resource usage with all of those pages loaded in a short time period.

A Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

A Crawl-delay: of 500 seconds would allow crawlers to index your entire 1,000 page website in 5.8 days

You can set the Crawl-delay: for all search engines at once with:

User-agent: *
Crawl-delay: 30

Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *
Disallow: 

Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *
Disallow: /

Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider
Disallow: /

Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin/, /private/, and /tmp/ we didn't want bots to crawl we could use this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /tmp/

Disallow all search engines from particular files:

If we had a files like contactus.htm, index.htm, and store.htm we didn't want bots to crawl we could use this:

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm
Disallow: /store.htm

Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:

User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

Did you find this article helpful?

We value your feedback!

Why was this article not helpful? (Check all that apply)
The article is too difficult or too technical to follow.
There is a step or detail missing from the instructions.
The information is incorrect or out-of-date.
It does not resolve the question/problem I have.
How did you find this article?
Please tell us how we can improve this article:
Email Address
Name

new! - Enter your name and email address above and we will post your feedback in the comments on this page!

Related Questions

Here are a few questions related to this article that our customers have asked:
IP address hammering /administrator/index.php
Would you like to ask a question about this page? If so, click the button below!
Ask a Question
n/a Points
2014-04-17 6:46 pm

That idea of blocking search engines worked perfect on my site.

Thanks for the precise example you have in lower half.

Abhi from http://howzthatim.com

n/a Points
2014-06-02 5:29 am

I have two websites pointed to same folder. How can I disallow one website.

Staff
8,517 Points
2014-06-02 8:32 am
As the robots.txt file only determines what files are able to be accessed, unfortunately you would not be able to block a specific domain if it uses the same files as another site that you do want to be accessed.
n/a Points
2014-06-16 12:13 pm

I have looked for info about robot.txt on the web numerous times and this is the only one that made sense. thank you so much!!!

n/a Points
2014-07-18 6:28 am

Thanks for a detailed explanation on this all important topic. God bless you.

 

n/a Points
2014-07-24 10:35 am

I would like to disallow semalt and semalt-semalt crawlers from wreaking havoc on my bounce rate. If I use the code to disallow one particular search engine, do I need to write this code twice? Once for each individual crawler? Or maybe a comma between them? Thank you

Staff
9,037 Points
2014-07-24 10:54 am
Hello Mark,

Thank you for your question. It seems to be a common problem, judging by the amount of search results.

I found the following solution via online search, where it is blocked by referrer:

# block visitors referred from semalt.com
RewriteEngine on
RewriteCond %{HTTP_REFERER} semalt\.com [NC]
RewriteRule .* - [F]


If you have any further questions, feel free to post them below.
Thank you,

-John-Paul
n/a Points
2014-07-26 3:06 am

With regards to the crawl delay, so do i understand this correctly, if you introduce a longer delay for a bot to crawl your site, it doesn't reduce the cpu load, merely spreads it out over a longer period ?

Staff
9,968 Points
2014-07-26 10:48 am
Hello Andy,

Yes you understand the crawl delay for robots correctly, it just causes the robot's requests to be spread out over a longer time period. But much like a highway dealing with traffic jams, high amounts of usage during short intervals of time can cause back ups and delays, but if the usage is spread out over the course of a day it's not as noticeable on the highway or server and that's typically what you're trying to achieve with a crawl delay.

Please let us know if you had any further questions at all.

- Jacob
n/a Points
2014-08-11 2:53 pm

buena informacion  gracias

great inf thanks

n/a Points
2014-08-29 9:49 am

Hello!

I am currently developing a larger website and while it is still in development I'd prefer that search engines do not crawl through it, that is until I am finished.  This way I can post the site so that multiple developers can code and test without the world knowing the site exists on google and such.  It seems to me that the code above would do that, am I correct in my acessment?

Thanks,

Jay

Staff
18,784 Points
2014-08-29 11:30 am
Hello Jay,

Unfortunately, most search engines, including Bing and Google are paying less attention to the robots.txt file. The best way to prevent anyone else from seeing the site, or having the Search Engines index it until you are ready is to password protect the site via the cPanel.

Kindest Regards,
Scott M
n/a Points
2014-08-29 12:14 pm

Thanks Scott for a great tip!

n/a Points
2014-09-30 5:34 am

Hello guys , i want to stop search engine to crawling my site from yahoo,google and bing. how it will be done?

Staff
9,037 Points
2014-09-30 9:09 am
Hello Ankit,

This article above is about just that. You can Disallow all search engines from crawling website, or just block the specific user-agents for yahoo, google, and bing (the user agents are listed above).

Are you having trouble with a specific step?

Thank you,
John-Paul
n/a Points
2014-10-01 3:31 am

Guys, I am having more problems realated to the SEO like: my website is made up in asp.net with 3.5 framework and i want a solution of www, home.aspx 301 redirection problem that what code an exactly for my website should be (www.rasavgems.com) and in which file i should be used it please explain it in details with steps.

 

Thanks

Ankit

Staff
18,784 Points
2014-10-01 8:54 am
Hello Ankit,

I am not sure exactly what it is you are asking. Please try to be a bit more detailed and give us some steps if you can. Also, as this does not seem to be related to the robots.txt file, please reply with a new question.

Kindest Regards,
Scott M
n/a Points
2014-10-01 3:15 am

Hey Johnpaulb

i used following kind of the methods :

# robots.txt generated for google
User-agent: Googlebot
Disallow: /
User-agent: *
Disallow: / 


# robots.txt generated for yahoo
User-agent: Slurp
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for Msn
User-agent: MSNBot
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for ask
User-agent: Teoma
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for bingbot
User-agent: bingbot
Disallow: /
User-agent: *
Disallow: /

please suggest me that , is it okay for my site to stop the search engine for crawling my site. i uploaded a robots.txt file with using such above methods togather in one robots.txt file.
Staff
18,784 Points
2014-10-01 8:56 am
Hello Ankit,

It is find if you do not want a search engine to crawl your site. If it does not, however, it means those pages may not get updated in the search engine or even show at all. If you wish, allow your favorite search engines to crawl your site at a reasonable delay if you want to show up in them. You can certainly set the file to block the others.

Kindest Regards,
Scott M

Post a Comment

Name:
Email Address:
Phone Number:
Comment:
Submit

Please note: Your name and comment will be displayed, but we will not show your email address.

19 Questions & Comments

Post a comment

Back to first comment | top

Need more Help?

Search

Ask the Community!

Get help with your questions from our community of like-minded hosting users and InMotion Hosting Staff.

Current Customers

Chat: Click to Chat Now E-mail: support@InMotionHosting.com
Call: 888-321-HOST (4678) Ticket: Submit a Support Ticket

Not a Customer?

Get web hosting from a company that is here to help. Sign up today!