In order for your website to be found by other people, search engine crawlers also sometimes referred to as bots or spiders, will crawl your website looking for updated text and links to update their search indexes with.

Control search engine crawlers with robots.txt file

Website owners, can instruct search engines on how they should crawl a website, by using a robots.txt file.

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

It's important to know robots.txt rules don't have to be followed by bots, and they are a guideline.

For instance to set a Crawl-delay for Google this must be done in the Google Webmaster tools.

For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.

Edit or create robots.txt file

The robots.txt file needs to be at the root of your site. If your domain was example.com it should be found:

On your website:

http://example.com/robots.txt

On your server:

/home/userna5/public_html/robots.txt

You can also create a new file and call it robots.txt as just a plain-text file if you don't already have one.

Search engine User-agents

The most common rule you'd use in a robots.txt file is based on the User-agent of the search engine crawler.

Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:

Top 3 US search engine User-agents:

Googlebot
Yahoo! Slurp
bingbot

Common search engine User-agents blocked:

AhrefsBot
Baiduspider
Ezooms
MJ12bot
YandexBot

Search engine crawler access via robots.txt file

There are quite a few options when it comes to controling how your site is crawled with the robots.txt file.

The User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.

Disallow: sets the files or folders that are not allowed to be crawled.

Set a crawl delay for all search engines:

If you had 1,000 pages on your website, a search engine could potentially index your entire site in a few minutes.

However this could cause high system resource usage with all of those pages loaded in a short time period.

A Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

A Crawl-delay: of 500 seconds would allow crawlers to index your entire 1,000 page website in 5.8 days

You can set the Crawl-delay: for all search engines at once with:

User-agent: *
Crawl-delay: 30

Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *
Disallow: 

Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *
Disallow: /

Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider
Disallow: /

Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin/, /private/, and /tmp/ we didn't want bots to crawl we could use this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /private/
Disallow: /tmp/

Disallow all search engines from particular files:

If we had a files like contactus.htm, index.htm, and store.htm we didn't want bots to crawl we could use this:

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm
Disallow: /store.htm

Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory, and disallow all other bots we could use:

User-agent: *
Disallow: /private/

User-agent: Googlebot
Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

Did you find this article helpful?

We value your feedback!

Why was this article not helpful? (Check all that apply)
The article is too difficult or too technical to follow.
There is a step or detail missing from the instructions.
The information is incorrect or out-of-date.
It does not resolve the question/problem I have.
How did you find this article?
Please tell us how we can improve this article:
Email Address
Name

new! - Enter your name and email address above and we will post your feedback in the comments on this page!

Related Questions

Here are a few questions related to this article that our customers have asked:
IP address hammering /administrator/index.php
Would you like to ask a question about this page? If so, click the button below!
Ask a Question
n/a Points
2014-04-17 6:46 pm

That idea of blocking search engines worked perfect on my site.

Thanks for the precise example you have in lower half.

Abhi from http://howzthatim.com

n/a Points
2014-11-27 12:17 pm

thank for your post . I also apply to my site and it works perfectly

n/a Points
2014-06-02 5:29 am

I have two websites pointed to same folder. How can I disallow one website.

Staff
10,022 Points
2014-06-02 8:32 am
As the robots.txt file only determines what files are able to be accessed, unfortunately you would not be able to block a specific domain if it uses the same files as another site that you do want to be accessed.
n/a Points
2014-06-16 12:13 pm

I have looked for info about robot.txt on the web numerous times and this is the only one that made sense. thank you so much!!!

n/a Points
2014-07-18 6:28 am

Thanks for a detailed explanation on this all important topic. God bless you.

 

n/a Points
2014-07-24 10:35 am

I would like to disallow semalt and semalt-semalt crawlers from wreaking havoc on my bounce rate. If I use the code to disallow one particular search engine, do I need to write this code twice? Once for each individual crawler? Or maybe a comma between them? Thank you

Staff
11,080 Points
2014-07-24 10:54 am
Hello Mark,

Thank you for your question. It seems to be a common problem, judging by the amount of search results.

I found the following solution via online search, where it is blocked by referrer:

# block visitors referred from semalt.com
RewriteEngine on
RewriteCond %{HTTP_REFERER} semalt\.com [NC]
RewriteRule .* - [F]


If you have any further questions, feel free to post them below.
Thank you,

-John-Paul
n/a Points
2014-07-26 3:06 am

With regards to the crawl delay, so do i understand this correctly, if you introduce a longer delay for a bot to crawl your site, it doesn't reduce the cpu load, merely spreads it out over a longer period ?

Staff
9,968 Points
2014-07-26 10:48 am
Hello Andy,

Yes you understand the crawl delay for robots correctly, it just causes the robot's requests to be spread out over a longer time period. But much like a highway dealing with traffic jams, high amounts of usage during short intervals of time can cause back ups and delays, but if the usage is spread out over the course of a day it's not as noticeable on the highway or server and that's typically what you're trying to achieve with a crawl delay.

Please let us know if you had any further questions at all.

- Jacob
n/a Points
2014-08-11 2:53 pm

buena informacion  gracias

great inf thanks

n/a Points
2014-08-29 9:49 am

Hello!

I am currently developing a larger website and while it is still in development I'd prefer that search engines do not crawl through it, that is until I am finished.  This way I can post the site so that multiple developers can code and test without the world knowing the site exists on google and such.  It seems to me that the code above would do that, am I correct in my acessment?

Thanks,

Jay

Staff
20,924 Points
2014-08-29 11:30 am
Hello Jay,

Unfortunately, most search engines, including Bing and Google are paying less attention to the robots.txt file. The best way to prevent anyone else from seeing the site, or having the Search Engines index it until you are ready is to password protect the site via the cPanel.

Kindest Regards,
Scott M
n/a Points
2014-08-29 12:14 pm

Thanks Scott for a great tip!

n/a Points
2014-09-30 5:34 am

Hello guys , i want to stop search engine to crawling my site from yahoo,google and bing. how it will be done?

Staff
11,080 Points
2014-09-30 9:09 am
Hello Ankit,

This article above is about just that. You can Disallow all search engines from crawling website, or just block the specific user-agents for yahoo, google, and bing (the user agents are listed above).

Are you having trouble with a specific step?

Thank you,
John-Paul
n/a Points
2014-10-01 3:31 am

Guys, I am having more problems realated to the SEO like: my website is made up in asp.net with 3.5 framework and i want a solution of www, home.aspx 301 redirection problem that what code an exactly for my website should be (www.rasavgems.com) and in which file i should be used it please explain it in details with steps.

 

Thanks

Ankit

Staff
20,924 Points
2014-10-01 8:54 am
Hello Ankit,

I am not sure exactly what it is you are asking. Please try to be a bit more detailed and give us some steps if you can. Also, as this does not seem to be related to the robots.txt file, please reply with a new question.

Kindest Regards,
Scott M
n/a Points
2014-10-01 3:15 am

Hey Johnpaulb

i used following kind of the methods :

# robots.txt generated for google
User-agent: Googlebot
Disallow: /
User-agent: *
Disallow: / 


# robots.txt generated for yahoo
User-agent: Slurp
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for Msn
User-agent: MSNBot
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for ask
User-agent: Teoma
Disallow: /
User-agent: *
Disallow: /


# robots.txt generated for bingbot
User-agent: bingbot
Disallow: /
User-agent: *
Disallow: /

please suggest me that , is it okay for my site to stop the search engine for crawling my site. i uploaded a robots.txt file with using such above methods togather in one robots.txt file.
Staff
20,924 Points
2014-10-01 8:56 am
Hello Ankit,

It is find if you do not want a search engine to crawl your site. If it does not, however, it means those pages may not get updated in the search engine or even show at all. If you wish, allow your favorite search engines to crawl your site at a reasonable delay if you want to show up in them. You can certainly set the file to block the others.

Kindest Regards,
Scott M
n/a Points
2014-10-03 12:57 am

Thanks Scott M :)

Have a great day!!

n/a Points
2014-10-07 2:21 pm

Google is including my shopping cart pages in its searches.  They are not in a folder that I can block like

User-agent: *
Disallow: /cgi-bin/

 

Is there a way to block files that all begin with:

/addtocart.sc?productld=13&quantity=1

/addtocart.sc?productld=14&quantity=1

/addtocart.sc?productld=23&quantity=1

etc.?

Thank you

Staff
10,022 Points
2014-10-07 2:40 pm
To do so, you could do something like this:

User-agent: *
Disallow: /addtocart.sc

n/a Points
2014-11-26 8:49 am

Hi !

how can I block folder /2014/11/ ? Here is my current site locatedhttp://dbmakemoney.com/2014/11/other-advertising-networks-besides-google-adsense/

I want to 

http://dbmakemoney.com/other-advertising-networks-besides-google-adsense/

 

Thanks in advance!

Staff
20,924 Points
2014-11-26 9:19 am
Hello Lybear,

What exactly are you asking? Do you want to block access to 2014/11 folder? Or are you looking to set up a redirect of some sort?

Kindest Regards,
Scott M
n/a Points
2014-11-26 8:14 pm

Hello Scott ,

I want all my sites under the 2014/11/mysites show only mysites withouth 2014/11 folder .actully I don't know what is different between block access 2014/11 folder and redirect of some sort ?If possible could you show me of both way .

Best Regards,

Staff
18,513 Points
2014-11-26 9:04 pm
Hello Lybear,

Thanks for the question. If you are trying prevent search engines from accessing the directory you're indicating, then you can use the ROBOTS.TXT tutorial above for this purpose. A re-direct used to change the path of a URL from one location to another. If you have other things that rely on that URL and the files at that location, then you may not want to do the re-direct. If you want more information on creating a re-direct, try reviewing Setting a 301 Redirect in your HTACCESS.

I hope this helps to provide the answer that you seek. If you require further assistance, please let us know.

Regards,
Arnel C.
n/a Points
2014-10-07 2:44 pm

Thanks for the quick reply.  I'll give that a shot.

n/a Points
2014-10-17 9:30 am

How do I stop robots with an IP range that they are coming from with the robot.txt

Staff
10,022 Points
2014-10-17 9:35 am
Nearly all bots that are not reputable search engines will completely ignore the robots.txt file and continue to crawl. Your best solution would be to block the IP range using .htaccess.
n/a Points
2014-11-17 5:48 am

Hi there,

We are faced with a situation where we have to rebuild and replace a client's existing website with a new site. Going from static html to Wordpress...

What is the best way to completely block the new site while in development?

Should we use a password protect method?

Regards

greg 

 

Staff
11,080 Points
2014-11-17 10:43 am
Hello Greg,

Thank you for your question. You can easily block access to your new site by using the Password Protect tool in cPanel.

That tool adds the .htaccess rules for you.

If you have any further questions, feel free to post them below.

Thank you,
John-Paul
n/a Points
2014-11-17 11:28 am

Thanks Jean-Paul,

Just a couple of further questions:

I setup a subdomain to build the new site which I want to block from the search engines.

So what is a bit confusing is - at what level do you set the password protect?

Should it be at the /public_html/abcdirectory/ which is the document root?

Also, how do you test to see that the password is actually working? I set the password as above and then was immediately able to login the the WP dashboard without having to enter a username and password....

Am I missing something?

Appreciate your help..

Regards

Greg

Staff
20,924 Points
2014-11-17 1:15 pm
Hello Greg,

If you have the WordPress site in a subfolder, say like example.com/test Then you would set the password at the folder level for 'test'. This way no one would see the site while you were developing. You may be interested in our articles on password protecting a folder within the cPanel. You can also ask your questions about passwords on that article since it is relevant.

As for checking for to see if it is working, use a browser in incognito mode so it appears to be a new visitor. You should see it ask for username and password then. Once you have logged in with a browser in normal mode, it remembers you for a time.

Kindest Regards,
Scott M
n/a Points
2014-11-24 12:01 pm

Hi there,

I have about 40 WordPress websites on one hosting account and every evening around the same time, my hosting gets sluggish and goes down for about 20 to 30 minutes. I have looked at the server logs and it looks like that's when sites are getting crawled by Google. Previously, I haven't had any specific robots.txt files on each site (shame on me, yes). I have added robots.txt files for all the sites with fairly restrictive disallow settings that really only give access to the wp-content folder (minus the theme and plugins). Will reducing the access to the bots significantly reduce the impact on my server when the sites are being crawled or do I also need to set a crawl delay?Also, only a couple of the sites are blogs and those are the only ones with a significant amount of pages. The rest are small, static sites. Would you recommend just setting a crawl delay on the large blogs that have 1,000+ pages and posts?

Thanks!

Staff
11,080 Points
2014-11-24 12:31 pm
Hello Neil,

Thank you for your question. While setting a crawl delay may help, we would need to see the nature of the requests to provide a detailed answer.

This is because you may be getting crawled by bots that are not following your robots.txt rules. In this case a robots.txt file will not help. Instead, identify and block the specific bots from your site.

Thank you,
John-Paul
n/a Points
2014-11-24 12:43 pm

Thanks John-Paul. Until a few hours ago I did not have any robots.txt rules. A few hours ago I created the robots.txt file for each site with more restrictive disallow rules instructing bots to not crawl the wp-includes folder, the theme and plugin folders and wp-admin. I'm hoping this reduces the scope and impact of the bots on the server each evening. If not, then perhaps a crawl delay would at least spread the impact out and not take down the server...

Staff
20,924 Points
2014-11-24 12:47 pm
Hello Neil,

While using robots.txt and setting delays may help, overall, search engines now ignore the file. This even includes Google. You can set your preferences for them from within Google's webmaster tools. For other search engines, setting the delays and requests not to crawl in robots.txt is done with the expectation and hope that they will listen.

Kindest Regards,
Scott M
n/a Points
2014-11-26 10:05 pm

Hello Arn,

Thanks for your reply .

anyway in the ROBOTS.TXT I have already do this way 

Disallow: /2014/11/but it still show in the Browser site.com/2014/11/mystiesI only want to show site.com/mystiesBest regards,

Staff
18,513 Points
2014-11-26 10:40 pm
Hello Lybear,

The ROBOTS.TXT file will NOT block you from accessing that folder. It only prevents search bots from going into the folder. In order to prevent your browser from using /2014/11, then you will need to create a rewrite rule in your .htaccess file. Try reviewing this forum for a rewrite rule that may help in your case.

Kindest regards,
Arnel C.
n/a Points
2014-11-27 12:12 am

Hello Arn !

 

I have follow this step and got error 

Here is my error 

 

0 # BEGIN WordPress

1 <IfModule mod_rewrite.c>

2 RewriteEngine On

3 RewriteRule ^2014/11/(.*)$ $1 [L,QSA]

4 RewriteBase /

5 RewriteRule ^index\.php$ - [L]

6 RewriteCond %{QUERY_STRING} !lp-variation-idThis condition was met

7 RewriteRule ^go/([^/]*)? /wp-content/plugins/landing-pages/modules/module.redirect-ab-testing.php?permalink_name=$1 [QSA,L]

8 RewriteRule ^landing-page=([^/]*)? /wp-content/plugins/landing-pages/modules/module.redirect-ab-testing.php?permalink_name=$1 [QSA,L]

9 RewriteCond %{REQUEST_FILENAME} !-fThis variable is not supported: %{REQUEST_FILENAME}

10 RewriteCond %{REQUEST_FILENAME} !-dThis variable is not supported: %{REQUEST_FILENAME}

11 RewriteRule . /index.php [L]This rule was not met because one of the conditions was not met

12 </IfModule>

13 # END WordPress

 This rule was met, the new url is http://dbmakemoney.com/other-advertising-networks-besides-google-adsense/

Staff
18,513 Points
2014-11-28 10:10 pm
Hello Lybear,

Sorry that you're having issues with the re-direct. You may want to use the the official Apache documentation on mod rewrite to determine how best to write your rule. We can't write the rule for you, unfortunately. It does appear that you have modified the original WordPress .htaccess rule. You may want to remove the rule. Check out the rewrite rules in the articles listed for .htaccess files.

Apologies that we cannot give you a direct answer on the issue. Hopefully, this will help direct you to a more appropriate answer.

Regards,
Arnel C.
n/a Points
2014-12-17 12:41 pm

Is there an easy way to implement crawl delays serverwide for all domains?

Staff
18,513 Points
2014-12-17 1:30 pm
Hello D,

Thanks for the question. Each robots.txt file applies to each domain. If you want to apply a crawl delay for each domain, then simply use the instructions above, then copy the file to each domain where you need the crawl delay to apply. You can't do it from one location.

I hope this helps to answer your question, please let us know if you require any further assistance.

Regards,
Arnel C.
n/a Points
2014-12-17 1:31 pm

thanks!

n/a Points
2014-12-18 9:05 am

my website not showing any page in google

n/a Points
2014-12-18 9:06 am

my website not showing any page in google listing in serach directory.expertwebworld.com   i dont know why i check all robot but not disallow google big robot . Even in meta its index,follow tags . 

Staff
20,924 Points
2014-12-18 10:22 am
Hello ExpertWebWorld,

I can see your website is indexed in Google. From here you can just focus on SEO for specific keywords for your pages. But it is definitely visible in the index, so Google has noticed it.

Kindest Regards,
Scott M
n/a Points
2014-12-19 12:56 am

yes you are right , but me little bit surprise to that none of my directory listing page is showing in google search http://directory.expertwebworld.com/search.php?cn=Computer+leasing+-+rental other pages like about us, blog, portfolio etc are showing but the record which submit by the visitor in different category. hope you understand what i mean

Staff
20,924 Points
2014-12-19 9:00 am
Hello ExpertWebWorld,

Thanks for getting back with us. Google has it's own policies and algorithms for indexing and will never let anyone know so they cannot be manipulated. The best anyone can do is to work on SEO and likely with Google Webmaster Tools to help themselves in the ranking. SEO is a relationship between web pages and Google.

Kindest Regards,
Scott M

Post a Comment

Name:
Email Address:
Phone Number:
Comment:
Submit

Please note: Your name and comment will be displayed, but we will not show your email address.

50 Questions & Comments

Post a comment

Back to first comment | top

Need more Help?

Search

Ask the Community!

Get help with your questions from our community of like-minded hosting users and InMotion Hosting Staff.

Current Customers

Chat: Click to Chat Now E-mail: support@InMotionHosting.com
Call: 888-321-HOST (4678) Ticket: Submit a Support Ticket

Not a Customer?

Get web hosting from a company that is here to help. Sign up today!