How to Stop Search Engines from Crawling your Website

In order for your website to be found by other people, search engine crawlers, also sometimes referred to as bots or spiders, will crawl your website looking for updated text and links to update their search indexes.

How to Control search engine crawlers with a robots.txt file

Website owners can instruct search engines on how they should crawl a website, by using a robots.txt file.

When a search engine crawls a website, it requests the robots.txt file first and then follows the rules within.

It’s important to know robots.txt rules don’t have to be followed by bots, and they are a guideline.

For instance, to set a Crawl-delay for Google this must be done in the Google Webmaster tools.

For bad bots that abuse your site you should look at how to block bad users by User-agent in .htaccess.

Edit or create robots.txt file

The robots.txt file needs to be at the root of your site. If your domain was example.com it should be found:

On your website:

https://example.com/robots.txt

On your server:

/home/userna5/public_html/robots.txt

You can also create a new file and call it robots.txt as just a plain-text file if you don’t already have one.

Search engine User-agents

The most common rule you’d use in a robots.txt file is based on the User-agent of the search engine crawler.

Search engine crawlers use a User-agent to identify themselves when crawling, here are some common examples:

Top 3 US search engine User-agents:

Googlebot 

Yahoo! 

Slurp bingbot

Common search engine User-agents blocked:

AhrefsBot 

Baiduspider 

Ezooms 

MJ12bot 

YandexBot

Search engine crawler access via robots.txt file

There are quite a few options when it comes to controlling how your site is crawled with the robots.txt file.

The User-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any User-agent.

Disallow: sets the files or folders that are not allowed to be crawled.


Set a crawl delay for all search engines:

If you had 1,000 pages on your website, a search engine could potentially index your entire site in a few minutes.

However, this could cause high system resource usage with all of those pages loaded in a short time period.

A Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

A Crawl-delay: of 500 seconds would allow crawlers to index your entire 1,000 page website in 5.8 days

You can set the Crawl-delay: for all search engines at once with:

User-agent: * 
Crawl-delay: 30


Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *
Disallow:


Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *
Disallow: /


Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider 
Disallow: /


Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin/, /private/, and /tmp/ we didn’t want bots to crawl we could use this:

User-agent: * 
Disallow: /cgi-bin/ 
Disallow: /private/ 
Disallow: /tmp/


Disallow all search engines from particular files:

If we had files like contactus.htm, index.htm, and store.htm we didn’t want bots to crawl we could use this:

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm 
Disallow: /store.htm


Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory and disallow all other bots we could use:

User-agent: * 
Disallow: /private/  
User-agent: Googlebot 
Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

InMotion Hosting Contributor
InMotion Hosting Contributor Content Writer

InMotion Hosting contributors are highly knowledgeable individuals who create relevant content on new trends and troubleshooting techniques to help you achieve your online goals!

More Articles by InMotion Hosting

168 thoughts on “How to Stop Search Engines from Crawling your Website

  1. How can I assist Google to revisit the following page everyday?
    sehdevpackers.com/packers-movers-gurgaon

    1. Hello Sehdev Packers and Movers – Realistically, you can’t force Google to re-crawl/visit your page every day. You can be creating content that they will update or note when your site is changing. That is probably the most realistic way to get Google to review your site. But when your site is new, it won’t happen immediately. Check out this article for more information on forcing Google to recrawl your site: https://www.searchenginewatch.com/2018/04/20/how-to-force-google-to-recrawl-your-website/

  2. HI,

    I am getting in Admin work result Google Analytics. This location /admin/ is counting pageview. I don’t want to crawl my Admin work by Google. What exactly I have to write in Robots.txt file to stop crawling all admin work from back end? Can anyone help me out with this problem?

    1. Hello Sam,

      Unfortunately, it is sometimes impossible to keep Google from indexing certain pages, even with robots.txt blocks in place. You may want to contact the developer of the site to see if there is a way to avoid the indexing of that page.

    1. Hello and thanks for contacting us. I recommend you contact Google directly and ensure all website metadata is updated accordingly.

  3. According to an e-mail I received from Google, blocking google bots will be penalized as an error. Since most crawlers do not care what the the robots.txt suggests this article is practically obsolete. Use X-Robots-Tag instead or better move your files bellow the public_html.

    1. Hi, Luis — Thank you so much for your comment. We’ll certainly review the article and make the appropriate changes.

  4. Hi, thanks for this detailed description. I’m not sure you can answer this question but here goes.

    I’m trying to monitor an offline marketing campaign and wanna get as precise results as possible. What I was thinking was to make a copy of my website, publish it and not allow any bots to crawl it.

    The idea is to have a website that can only be found by people who have been reached by the offline marketing and at the same time avoid my original website (that is crawled and ranks well) being punished for duplicate content as I would simply copy me site.

    Would the above method accomplish this goal?

    Once again, thanks for your help

    1. Hello Janus,

      Thank you for your comment. Yes, you can set up a cloned version for this purpose and block bots from crawling it, however you will likely need to use a subdomain such as dev.example.com as you cannot host two versions of a live site on the same domain name.

      Best Regards,
      Alyssa K.

  5. here is my robots.txt file after edited and updated:
    ” User-agent: *
    Disallow: /wp-admin/
    Allow: /wp-admin/admin-ajax.php ”

    and here is it when i check https://domain/robots.txt
    ” User-agent: *
    Disallow: / ”

    i seem website can’t update any change, so google can’t index my website. Please help me.

  6. Thanks for sharing your knowledge and information, this must help us, appreciate your post. Saily from TechRecur

  7. Thanks for the detailed guide on how to block search engines from indexing the site and the surrounding values.

  8. It may affect how your website shows in search engine results, but it shouldn’t affect your users’ experience negatively. It may make the website faster.

  9. I have added robots.txt file with certain guidelines in my web-app. Now , I want to serve my new robots.txt file. How could I do so ?? Help : Urgent

    1. Hello – sorry for the issue with your site not being crawled by Google. You can go to WebMaster tools (from Google) and make sure that your site is being searched. Make sure that you do NOT have a Robots.TXT file that is blocking their crawler as per the instructions in this article.

  10. The article above provides information on how to stop bots from crawling your site. If you are unable to use the information above, then I recommend speaking with a website developer for further assistance.

  11. In my robos.txt file I have written the following code:

    User-agent: *
    Disallow: /

    But this is not working. I am still seeing my website in search engine.
    1. If your website was already in the search engine, this rule does not remove it. The ROBOTS.TXT file suggests that the search engine not use it. Google supposedly does listen to this file, but remember that it is only a recommendation, not a requirement for search engines to follow the Robots.txt. If you want the search result removed, you will need to contact the search engine directly. They(the search engines) typically have a procedure to have the search results removed.

  12. In crawl-delay, whether it will be taken in seconds or milliseconds? I got some biased answers from internet, can you make it clear?

  13. When I see user-agent: * (does this mean Googlebot is automatically there or do I have to type in Googlebot)

    Also If I see Disallow: / (could I remove the line and make it ‘allow?’ If so, where do I go to do this? I’m using WordPress platform.

    1. You should specify Googlebot as shown in the example above. We are happy to help with a disallow rule but will need more information on what you are attempting to accomplish.

      Thank you,
      John-Paul

  14. Hi. I want to block all crawlers on my site (forum).

    But for a some reason, my command in “robots.txt” file don’t take any effect.

    Actually, all is pretty same with, or without it.

    I have constantly at least 10 crawlers (bots) on my forum… 

    Yes. I done a right command. I made sure that nothing is wrong, it’s pretty simple.

    User-agent: *

    Disallow: /

    And still on my forum, I have at least 10 bots (as guests) and they keep visiting my site. I tried banning some IP’s (wich are very similar to each other). They are banned, but they still coming… And I’m receiving notification in my admin panel because of them.

    Example: https://prntscr.com/hptzz3 ;

    I at least tried to write mail to hosting provider of that IP adress for abuse. They replied me that “that” is only a crawler… Now… Any recommendations? 🙂 Thanks.

  15. Hello, 

    My Robot.txt is
    User-agent: *
    Disallow: /profile/*

     

    because i dont want anybot to crawl the user’s profile, why? because it was bringing many unusual traffic to the website, and high Bounce rate, 

    after i uploaded the robot.txt, i noticed a steep drop in the traffic to my website, and i am not getting relevant traffic as well, please advise what should i do? 
    i have done audit process as well and can’t find the reason whats holding it back. 

    1. If the only change you made was to the robots.txt file then there should be no reason for the sudden drop-off in traffic. My suggestion is that you remove the robots.txt entry and then analyze the traffic that you are receiving. If it continues to be an issue, then you should speak with an experienced web developer/analyst in order to help you determine what could be affecting the traffic on your site.

  16. I want to block my main domain name from being crawled, but add on domains to be crawled. The main domain is just a blank site that I have with my Hosting Plan. If I put robot.txt in public_html to prevent crawlers, will it affect my clients’ add on domains hosted inside sub folder of public_html? So, main domain is at public_html and sub domains are at public_html/clients/abc.com

    Any response will be appreciated.

  17. I have to block my website for only google austelia. i have 2 domain one for india (.com) and one for austria (.com.au) but still i found my indian domain in google.com.au so let me know what is the best solution to block only google.com.au for my website.

    1. Using the Robots.txt file is the remains one of the better ways to block a domain from being crawled by search engines including Google. However, if you’re still having trouble with it, then paradoxically, the best way to not have your website show in Google, is to index the page with Google and then use a metatag to let google know not to display your page(s) in their search engine. You can find a good article on this topic here.

  18. Google blocked my site, but I never put any robots.txt file to disallow google. I’m confused. Why would Google not be tracking my page if I didn’t use a robots file?

    1. You may want to double-check your analytics tracking code. Make sure that Google’s tracking code is visible on your site for each page you want to track.

  19. Hello scott

    Can you explain that if my domain or subdomain are in the same root file. so how can i block the perticular subdomain by robots or etc.

  20. How can I block my site in Google Search Engine?

    But I want to index my site other search engine without google.

    which code I paste in robot.txt file?

    thanks advance 

  21. hy can you help me i want remove this link for google search. www.complaintboard.in/complaints-reviews/capital-cow-l427964.html

    i do search in google capital cow than this url show in 2nd possion but i want to remove or shift to next page for google so what to do? please suggest me..thanks

  22. user agent: *

    disallow: /

    Is it means it stops all bots to crwal our site?

    Please update me because i got confused between

    disllow: /abc.com/ and disallow: / 

    1. Yes, the code:
      user agent: *
      disallow: /

      is a request for the search engine to not crawl your site. They may ignore it if they choose.

    1. No, robots.txt file is to limit bots on the site. This prevents them from crawling. It does not block traffic. Traffic can be blocked by the htaccess file.

  23. I have a website wtih pages that are restricted with user/passw. On some of these restricted pages I call up PDF files. However, Google etc, finds and displays the contents of the file that was intended to restricted.

    Question: If I make a robot.txt file to block the PDF directory, will google forget the old index after a while. Or do I have to recreate the file with another name?

    1. If a folder is password protected correctly, it should not be accessible to be crawled by Google. So the robots.txt file shouldn’t make a difference. Even if they are listed in search results, it should not be accessible as long as they are password protected.

      After google re-crawls your site, it should update the links and no longer list the pdfs. If they are not crawling your site, you can request they reconsider crawling your site.

      Thank you,
      John-Paul

  24. Hello Everyone I have read all the above but still not able to get it so please reply me

    how can I disallow spiders crawlers and robots of search engines like google and bing to see my web page but I also want them not to block me or assume that I am a malware or something. I want to run a PPC campaign on Google and also want to redirect my link from www.example.com to www.example.com/test

    or if I can change the whole url like from www.example.com to www.xyz.com

    The catch is that I don’t want the bots to see my redirected domain. 

    Any help will be appriciated as I have seen above that you people have resolved almost everyone’s issue. hope mine will be resolved too

    1. Hello Nilesh,

      The robots.txt files are merely GUIDES for the Search engine bots. They are not required to follow the robots.txt file. That being said, you can use the directions above to direct typical bots (e.g. google, bing) in to not scan parts (or all of your website). So, if you don’t wan them to go through a re-directed site, then you simply have to create a robots.txt file FOR that site. If that site is not under you control, then you will not have a way to do that.

      If you have any further questions or comments, please let us know.

      Regards,
      Arnel C.

    2. I get a lot of spam mails. I tried adding a captcha , but still i get spam mails . Now I tried editing my robot.txt and disallowed access to contact-us page. I guess this might happen as my mail id is still there in clickable format. Did I do it right, Would this effect the SEO. Please suggest me a solution.

      How should I get rid of spam mails in future?!

    3. Bots do not have to comply with the robots.txt directives. Legitimate bots typically will but spam bots do not. So is the spam coming from the form on the contact page or is it just coming to your email address? If its the form getting filled out, captcha should help. If its simply email spam coming through, not from the form directly, you should look at changing the code so you email address is not exposed.

  25. Web crawlers crawl your site to Allows potential customers to find your website. Blocking search engine spiders from accessing your website makes your website less visible. Am I right? Why are people trying to block search engine spiders? What am I missing?

    1. Hello Elias,

      Yes, you are correct. However, sometimes, there are many files that you do NOT want a search engine to index (e.g. library of internal files). Spiders can also cause a load on the site. So, you can use a ROBOTS file to help control the search indexing of your site.

      I hope that helps to answer your question! If you require further assistance, please let us know!

      Regards,
      Arnel C.

  26. Hi, I am new to robots.txt. I would like to build a web crawler that only crawles a local site. Is it a rule that crawlers should crawl only through the alowed domains? What if my crawler ignores robots.txt file? Will there be any legal issues in doing so? Any help would be appreciated. Thanks!

    1. Hello Sunil,

      The Robots.txt file’s purpose was to allow website owners to lessen the impact of search crawlers on their sites. If you were to ignore it, then they may consider putting something else up to block you or consider your crawler malware.

      If you have any further questions, please let us know.

      Kindest regards,
      Arnel C.

    2. Hello Marnix,

      Thank you for contacting us. Here is a link to our guide on how to Block a country from your site using htaccess.

      To remove your site from the specific search engines, I recommend setting up accounts with them (such as Webmasster Tools from Google), and requesting that they do not crawl your sites.

      Thank you,
      John-Paul

  27. I’m wanting to block a website from being listed in only the UK search engines or being listed in the UK, e.g. google.co.uk, google.com, bing.co.uk, bing.com not to show a website when searching for it in the UK.

    How can this be done please?

    Best Regards,

    Marnix

  28. Apologies if this has been answered already. I couldn’t locate an answer…

    Greetings – I have a WordPress site, and will redevelop it in a separate file and then move the redeveloped site to the root directory. I want to block the www.example.com/dev/ file from being crawled until the new site is completed.

     

    Should the robots.txt file look like this, and will the live site www.example.com NOT be blocked while the /dev/ file will be blocked?

    User-agent: *

    Disallow: /example.com/dev/

    1. You only have to include the fodler name, like below:

      User-agent: *
      Disallow: /dev/

      This will keep the ‘dev’ folder top be not crawled.

    1. Hello Sharey,

      You have a Disallow: line in your robots.txt that has nothing past it. I would suggest to fix that part but other than that it looks great.

      Best Regards,
      TJ Edens

    1. Websites do not have dynamic ip’s but maybe I’m not understanding your question. Are you asking if your website should have a static IP address to be crawled?

  29. Thanks for the reply, Arn.

    I didn’t see anything about wildcards in that thread about htacess.  Anyhow, htaccess files are way too complicated for me.

    What I want to do is tell spiders to not look at .asp and .exe files.  Can *.asp and *.exe be used in a robots.txt file?

    1. To block specific file extensions, use the format below:

      User-agent: *
      Disallow: /*.gif$

      So in your case, you could have:
      User-agent: *
      Disallow: /*.asp$

      User-agent: *
      Disallow: /*.exe$

    1. Place a robots.txt file in your public_html or www directory and place the following code in the robots.txt file:

      User-agent: Yahoo! Slurp
      Disallow: /

  30. Can wildcards be used to specify files to disallow?  Like all .asp and .exe files?

    Disallow: /*.asp

    Disallow: /*.exe

    If the above would work, would it apply only to files in the root folder?

    Thanks

    1. Hello Brian,

      The Robots.txt file is specifically used for controlling what robots can or cannot see. You would need to access the .htaccess file in order to add rules about certain files. Check out this forum post about the subject .

      If you have any further questions or comments, please let us know.

      Regards,
      Arnel C.

  31. This is very old but cannot resist responding 🙂

    You can probably write a rewrite rule to detect host header HTTP_HOST and return a 404 response for robots.txt for the site  you want to allow search engines.

    1. We do not have a robots.txt file at the root level that would be conflicting with yours. It’s possibly a path error. Nevertheless, I recommend doing your development with a hosts file modification, so you can use the proper domain name.

  32. sirthe site how can i.

    i have in problem after google search one web site of ekikrat.in enter into another site which crawling by google wantto stop or hide 

    1. Hello,

      You can follow the guide above which will prevent googlebot from crawling your website.

      Best Regards,
      TJ Edens

  33. For an e-commerce B2B site, price is different for different user. So I want search engine not to index the price of the product. Is it possible? 

    waiting for the response.

     

    Thanking you,

    San

    1. Hello Sanjay,

      If you kept that information on separate page, then you could use Robots.txt to ignore that page. You could also encode that information on the page, but it would probably be best to simply not publish it and ask your customers to contact you for that information. Here’s a good post on keeping your content hidden.

      If you have any further questions or comments, please let us know.

      Regards,
      Arnel C.

  34. Hi,

    Is there a way to have an page indexed but not have one aspect of it crawled? We’d like to add an into box to the top of a page, but we don’t want the intro box crawled.

    Thanks,

    Pat

  35. Great post. Thanks a lot.

    I have one question please. I have the domain www.test.com right? and at the same time, i have this URL: https://mail.test.com

    how can I, through robots.txt, to block the https://mail.test.com from appearing in search results i.e. not to be crawled?

     

    thanks in advance

     

     

    1. You just need to create a robots.txt file in the root folder of the subdomain and enter the following code:

      User-agent: *
      Disallow: /

      This will block the entire subdomain from being crawled.

  36. I noticed that, on my server — ecres161 — when you’re developing a site and working with temp urls like this:

    https://ecres161.servconfig.com/~username/welcome

    … if you try to do anything that needs robots.txt, if won’t work.

    For example, Google’s various testing tools or sitemap software that looks at robots.txt. Both of those things fail for me, citing being prevented by robots.txt, even if I do not have a robots.txt file in my public_html dir.

    However, once I launch a site and the url is like: https://www.mydomain.com/welcome, then it *does* find the local robots.txt file and works fine.

    So, I suspect servconfig.com has its own robots.txt and is disallowing everything, which I understand may be good. But, it makes it tough to do any pre-testing work prior to launching a site.  So, is this done on purpose, or is it something tht can be changed on Inmotion’s server to allow us to do testing prior to launching a site?

  37. Hi, I have createed the appropriate Robots.txt and it has stopped indexing. The website in question is go.xxxxx.com. It is an internal CRM that we do not want visisble, all indexing has stopped except when I googe “go company name” or “company name go.” Then the site link pops up with no description because it says Robots.txt will not allow the crawler. Is there a way to get rid of it from indexing even the link to the page when searching that specific word. I assume it is finding it because it is in the URL?

    1. Hello Will,

      Robots.txt is basically a request for robots to not crawl the site. All search engines, Google included, will basically do what they want. Google listens to your options in Webmaster tools more than it will in robots.txt, so you may want to check that out as well.

      Kindest Regards,
      Scott M

    2. Hello.
      I had a similar problem. Because I receive a high amount ob crawlers and spiders to my website, I decided to redirect them to another domain name. Right now I see an improvement, but not all of them are gone. I see some chinese spiders that are still crowling my website.
      What can I do to stop them, how them can avoid redirection?
      Thank you!

    3. Hello Andru,

      Robots.txt is a request, but only good bots will listen to it. Bad bots will not listen to the robots.txt. Chinese bots are very often on the side that do not listen to the file. You may need to set up specific redirects or blocks for the ones that are more persistent.

      Kindest Regards,
      Scott M

  38. wow very nice article!

    I wanted to block my forum like www.site.com/forum

    so i using like this:

    User-agent: *
    Disallow: /forum

    Thanks :X
    1. Hi Monica,

      The html files are the individual pages, so yes, you would be blocking those particular pages from being crawled by the search engines that honor the request.

      Kindest Regards,
      Scott M

  39. We are using a program called Rapid Weaver, a mac program.

    How do I create a Robot.txt file for just certian pages that we do not want to have crawled? 

    I understand it needs to be in the root directory? 

    If possible tell me if I am understanding correctly:

    Create a page for example: https://www.amrtax.com/robot.Txt ( or robots.txt with an S ?)

    On that page before header:

    User-agent: *

    Dissallow:/findrefund.html

    Disallow:/whattobring.html

    Dissallow:/worksheets.htm

    Dissallow:/services.html

    Dissallow:/Staff.html

    Dissallow/enrolledagent.html

     

    Do I have the hang of it? If I uploade that page although not added to the Menu would this work?

     

    Trying to work it out in my head!

     

    1. Hello Monica,

      You’re blocking individual files from being searched with the rules above. And, yes, it’s robots.txt. Just follow the directions in the article above to complete the file properly.

      I hope this helps to answer your question, please let us know if you require any further assistance.

      Regards,
      Arnel C.

  40. I am getting lost of httpd request on my websites for particuller page which consume my lots of cpu and memory i want to block access on that page and drop http request for that page..

    Kindly suggest

    1. Hello Suraj,

      If they’re hitting a particular page on your website, you do have the option of removing that page if it’s not necessary. Otherwise, you can use the .htaccess file to create a redirect for that specific page. Check out the list of things you can do with the .htaccess file here.

      Regards,
      Arnel C.

  41. They are in different lines only, somehow they were bunched together when I posted a comment here.

    <IfModule mod_rewrite.c>

    RewriteCond %{HTTP_USER_AGENT} ^Yandex [NC,OR]

    RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]

    RewriteCond %{HTTP_USER_AGENT} ^Baidu [NC]

    RewriteRule ^.* – [F,L]

    </IfModule>

  42. These are the new entries from Baidu spider after all the entries made to block them.

     

    80.76.6.233 – – [18/Feb/2015:10:05:22 +1100] “GET /link/id/zzzz5448e5b9546e4300/page.html HTTP/1.1” 403 505 “-” “Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)”
    180.76.5.151 – – [18/Feb/2015:10:05:30 +1100] “GET /link/id/b57de3ecb30f9dc35741P8c23b17d6c9e0d8b4d5a/page.html HTTP/1.1” 403 521 “-” “Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)”
    123.125.71.109 – – [18/Feb/2015:10:05:34 +1100] “GET /media/dynamic/id/57264034bd6461d9b091zzzz52312bad5cc09124/interface.gif HTTP/1.1” 403 529 “-” “Baiduspider-image+(+https://www.baidu.com/search/spider.htm)\\nReferer: https://image.baidu.com/i?ct=503316480&z=0&tn=baiduimagedetail”

    1. Hello Manny,

      What .htaccess did you put these in? Please be sure to put it in the one located in your domains document root. Also these should all be separated line by line and not bunched together.

      Best Regards,
      TJ Edens

  43. Hello John,

     

    Thanks for your response. I have added the Rewrite rules as mentioned but still I see the baidu spider entries in the access.log

    180.76.5.64 – – [18/Feb/2015:08:17:31 +1100] “GET /link/id/zzzz547fe1b77394d419/page.html HTTP/1.1” 403 505 “-” “Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)”

     

    I have the following entries in the .htaccess file.

     

    <IfModule mod_rewrite.c>
    RewriteEngine On
    RewriteCond %{HTTP_USER_AGENT} ^.*(baidu|Baiduspider|HTTrack|Yandex|Majestic).*$ [NC]
    RewriteRule .* – [F,L]
    </IfModule>

    BrowserMatchNoCase baidu banned
    Deny from env=banned

    BrowserMatchNoCase “Baiduspider” bots
    BrowserMatchNoCase “HTTrack” bots
    BrowserMatchNoCase “Yandex” bots
    BrowserMatchNoCase “Baidu” bots

    Order Allow,Deny
    Allow from ALL
    Deny from env=bots

     

    Then I found the baidu spider requests are mostly from 180.76.5.x and 180.76.6.x IP addresses and then I blocked these IP range in .htaccess.

     

    Order Allow,Deny
    Allow from ALL
    Deny from env=bots

    order allow,deny
    allow from all
    # Block access to Baiduspider
    deny from 180.76.5.0/24 180.76.6.0/24

    But still I see the baidu spider entries in the access.log.

     

    Please help me to get rid of this asap. Thank you.

  44. Hi, This is really useful post. I have pasted my robots.txt file below. But still, I see the crawling from Yandex and Baiduspider. Please help me to fix this.

    User-agent: Googlebot

    Disallow: 
    User-agent: Adsbot-Google
    Disallow: 
    User-agent: Googlebot-Image
    Disallow: 
    User-agent: Googlebot-Mobile
    Disallow: 
    User-agent: MSNBot
    Disallow: 
    User-agent: bingbot
    Disallow: 
    User-agent: Slurp
    Disallow: 
    User-Agent: Yahoo! Slurp
    Disallow: 
    User-agent: MJ12bot
    Disallow: /
    User-agent: moget
    Disallow: /
    User-agent: ichiro
    Disallow: /
    User-agent: Yeti
    Disallow: /
    User-agent: NaverBot
    Disallow: /
    User-agent: sogou spider
    Disallow: /
    User-agent: YoudaoBot
    Disallow: /
    User-agent: Baiduspider
    Disallow: /
    User-agent: Baiduspider-video
    Disallow: /
    User-agent: Baiduspider-image
    Disallow: /
    User-agent: Yandex
    Disallow: /


    180.76.6.135 - - [15/Feb/2015:13:12:15 +1100] "GET / HTTP/1.1" 403 984 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +https://www.baidu.com/search/spider.html)"

    I see that the crawling from Yandex.com refers the robots.txt file and seems it was not allowed to crawl my website. The crawling from Yandex.ru looks like it was allowed.

    2.93.117.172 – – [16/Feb/2015:03:54:17 +1100] “GET / HTTP/1.1” 200 11289 “https://yandex.ru/yandsearch?text=e.bom.gov.au&lr=213” “Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0; MASPJS)”

    100.43.91.14 – – [16/Feb/2015:04:04:35 +1100] “GET /robots.txt HTTP/1.1” 200 1071 “-” “Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots)”

    100.43.91.14 – – [16/Feb/2015:04:07:09 +1100] “GET /robots.txt HTTP/1.1” 200 1071 “-” “Mozilla/5.0 (compatible; YandexBot/3.0; +https://yandex.com/bots)”

    95.221.127.107 – – [16/Feb/2015:04:08:28 +1100] “GET / HTTP/1.1” 200 9908 “https://yandex.ru/yandsearch?text=asa.i-events.info&lr=213” “Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.74 Safari/537.36 MRCHROME”

     

  45. Really technical but useful post. My site is built using WordPress, so I initially try the method introduced from https://wpmatter.com/how-to-prevent-search-engine-index-page/. However, what this post share are too basic for me, so I’m searching for something more advance about how to prevent search engines to index some of my low quality posts for days. And what you share are really helpful for me. Thanks a lot.

  46. Hello,

    domain is domain.com and subdomain is sub.domain.com

    I want to deindex sub.domian.com

    Any solutions?

     

    Thanks.

    1. Hello Darshita,

      To our knowledge, the only way to get a URL delisted from Google is to request it via Webmaster tools.

      Kindest Regards,
      Scott M

  47. Happy New Year Scott,

    Consider the Marketing Advantage of user defined servers. The buy-in is a simple form <– aka which regions and what level of outside access do you wish to allow.

    No one is doing this and its an Obvious advantage for clients!

    Please email me if inMotion catches a clue in the future.

    Best Regards!

  48. Thank You for this great article!

    My current host/website is getting pounded by crawlers, spam bots, and spiders. I’m seeing kits from wankers in Asia, France, Egypt, and morons in the US.

    It occurs to me, all of this nonsense can be rejected at the hosting server/router level before it hits a specific website user account on the host server.

    Does inmotionhosting.com offer a hosting option which denies access to all but a white list for those of us who could care less about a global audience and simply seek a testbench?

    Thanks for the help!

    1. Hello John,

      We do not normally block things on a level prior to reaching an account, though we do block bots that have been identified as being malicious. Most of the bots are other search engines such as Yandex, Baidu, etc and many of our customers do not mind being in those engines as well. We also cannot tell what bots one account wants and another doesn’t so we leave it up to each account to decide who they want to visit or not.

      Kindest Regards,
      Scott M

  49. its very helpfull article for me, one of my account suspended last night due to heavy traffic ( https://***********.in ) now i apply it to robot.txt, it is working for me…

    thanks

  50. yes you are right , but me little bit surprise to that none of my directory listing page is showing in google search https://directory.expertwebworld.com/search.php?cn=Computer+leasing+-+rental other pages like about us, blog, portfolio etc are showing but the record which submit by the visitor in different category. hope you understand what i mean

    1. Hello ExpertWebWorld,

      Thanks for getting back with us. Google has it’s own policies and algorithms for indexing and will never let anyone know so they cannot be manipulated. The best anyone can do is to work on SEO and likely with Google Webmaster Tools to help themselves in the ranking. SEO is a relationship between web pages and Google.

      Kindest Regards,
      Scott M

  51. my website not showing any page in google listing in serach directory.expertwebworld.com   i dont know why i check all robot but not disallow google big robot . Even in meta its index,follow tags . 

    1. Hello ExpertWebWorld,

      I can see your website is indexed in Google. From here you can just focus on SEO for specific keywords for your pages. But it is definitely visible in the index, so Google has noticed it.

      Kindest Regards,
      Scott M

    1. Hello D,

      Thanks for the question. Each robots.txt file applies to each domain. If you want to apply a crawl delay for each domain, then simply use the instructions above, then copy the file to each domain where you need the crawl delay to apply. You can’t do it from one location.

      I hope this helps to answer your question, please let us know if you require any further assistance.

      Regards,
      Arnel C.

    1. Hello Lybear,

      Sorry that you’re having issues with the re-direct. You may want to use the the official Apache documentation on mod rewrite to determine how best to write your rule. We can’t write the rule for you, unfortunately. It does appear that you have modified the original WordPress .htaccess rule. You may want to remove the rule. Check out the rewrite rules in the articles listed for .htaccess files.

      Apologies that we cannot give you a direct answer on the issue. Hopefully, this will help direct you to a more appropriate answer.

      Regards,
      Arnel C.

  52. Hello Arn !

     

    I have follow this step and got error 

    Here is my error 

     

    0 # BEGIN WordPress

    1 <IfModule mod_rewrite.c>

    2 RewriteEngine On

    3 RewriteRule ^2014/11/(.*)$ $1 [L,QSA]

    4 RewriteBase /

    5 RewriteRule ^index\.php$ – [L]

    6 RewriteCond %{QUERY_STRING} !lp-variation-idThis condition was met

    7 RewriteRule ^go/([^/]*)? /wp-content/plugins/landing-pages/modules/module.redirect-ab-testing.php?permalink_name=$1 [QSA,L]

    8 RewriteRule ^landing-page=([^/]*)? /wp-content/plugins/landing-pages/modules/module.redirect-ab-testing.php?permalink_name=$1 [QSA,L]

    9 RewriteCond %{REQUEST_FILENAME} !-fThis variable is not supported: %{REQUEST_FILENAME}

    10 RewriteCond %{REQUEST_FILENAME} !-dThis variable is not supported: %{REQUEST_FILENAME}

    11 RewriteRule . /index.php [L]This rule was not met because one of the conditions was not met

    12 </IfModule>

    13 # END WordPress

     This rule was met, the new url is https://dbmakemoney.com/other-advertising-networks-besides-google-adsense/

  53. Hello Arn,

    Thanks for your reply .

    anyway in the ROBOTS.TXT I have already do this way 

    Disallow: /2014/11/
    but it still show in the Browser site.com/2014/11/mysties
    I only want to show site.com/mysties

    Best regards,

    1. Hello Lybear,

      The ROBOTS.TXT file will NOT block you from accessing that folder. It only prevents search bots from going into the folder. In order to prevent your browser from using /2014/11, then you will need to create a rewrite rule in your .htaccess file. Try reviewing this forum for a rewrite rule that may help in your case.

      Kindest regards,
      Arnel C.

  54. Hello Scott ,

    I want all my sites under the 2014/11/mysites show only mysites withouth 2014/11 folder .
    actully I don’t know what is different between block access 2014/11 folder and redirect of some sort ?
    If possible could you show me of both way .


    Best Regards,

    1. Hello Lybear,

      Thanks for the question. If you are trying prevent search engines from accessing the directory you’re indicating, then you can use the ROBOTS.TXT tutorial above for this purpose. A re-direct used to change the path of a URL from one location to another. If you have other things that rely on that URL and the files at that location, then you may not want to do the re-direct. If you want more information on creating a re-direct, try reviewing Setting a 301 Redirect in your HTACCESS.

      I hope this helps to provide the answer that you seek. If you require further assistance, please let us know.

      Regards,
      Arnel C.

  55. Hi !

    how can I block folder /2014/11/ ?
     
    Here is my current site located
    https://dbmakemoney.com/2014/11/other-advertising-networks-besides-google-adsense/

    I want to 

    https://dbmakemoney.com/other-advertising-networks-besides-google-adsense/

     

    Thanks in advance!

    1. Hello Lybear,

      What exactly are you asking? Do you want to block access to 2014/11 folder? Or are you looking to set up a redirect of some sort?

      Kindest Regards,
      Scott M

  56. Thanks John-Paul. Until a few hours ago I did not have any robots.txt rules. A few hours ago I created the robots.txt file for each site with more restrictive disallow rules instructing bots to not crawl the wp-includes folder, the theme and plugin folders and wp-admin. I’m hoping this reduces the scope and impact of the bots on the server each evening. If not, then perhaps a crawl delay would at least spread the impact out and not take down the server…

    1. Hello Neil,

      While using robots.txt and setting delays may help, overall, search engines now ignore the file. This even includes Google. You can set your preferences for them from within Google’s webmaster tools. For other search engines, setting the delays and requests not to crawl in robots.txt is done with the expectation and hope that they will listen.

      Kindest Regards,
      Scott M

  57. Hi there,

    I have about 40 WordPress websites on one hosting account and every evening around the same time, my hosting gets sluggish and goes down for about 20 to 30 minutes. I have looked at the server logs and it looks like that’s when sites are getting crawled by Google. Previously, I haven’t had any specific robots.txt files on each site (shame on me, yes). I have added robots.txt files for all the sites with fairly restrictive disallow settings that really only give access to the wp-content folder (minus the theme and plugins). Will reducing the access to the bots significantly reduce the impact on my server when the sites are being crawled or do I also need to set a crawl delay?

    Also, only a couple of the sites are blogs and those are the only ones with a significant amount of pages. The rest are small, static sites. Would you recommend just setting a crawl delay on the large blogs that have 1,000+ pages and posts?

    Thanks!

    1. Hello Neil,

      Thank you for your question. While setting a crawl delay may help, we would need to see the nature of the requests to provide a detailed answer.

      This is because you may be getting crawled by bots that are not following your robots.txt rules. In this case a robots.txt file will not help. Instead, identify and block the specific bots from your site.

      Thank you,
      John-Paul

  58. Thanks Jean-Paul,

    Just a couple of further questions:

    I setup a subdomain to build the new site which I want to block from the search engines.

    So what is a bit confusing is – at what level do you set the password protect?

    Should it be at the /public_html/abcdirectory/ which is the document root?

    Also, how do you test to see that the password is actually working? I set the password as above and then was immediately able to login the the WP dashboard without having to enter a username and password….

    Am I missing something?

    Appreciate your help..

    Regards

    Greg

    1. Hello Greg,

      If you have the WordPress site in a subfolder, say like example.com/test Then you would set the password at the folder level for ‘test’. This way no one would see the site while you were developing. You may be interested in our articles on password protecting a folder within the cPanel. You can also ask your questions about passwords on that article since it is relevant.

      As for checking for to see if it is working, use a browser in incognito mode so it appears to be a new visitor. You should see it ask for username and password then. Once you have logged in with a browser in normal mode, it remembers you for a time.

      Kindest Regards,
      Scott M

  59. Hi there,

    We are faced with a situation where we have to rebuild and replace a client’s existing website with a new site. Going from static html to WordPress…

    What is the best way to completely block the new site while in development?

    Should we use a password protect method?

    Regards

    greg 

     

    1. Hello Greg,

      Thank you for your question. You can easily block access to your new site by using the Password Protect tool in cPanel.

      That tool adds the .htaccess rules for you.

      If you have any further questions, feel free to post them below.

      Thank you,
      John-Paul

  60. Google is including my shopping cart pages in its searches.  They are not in a folder that I can block like

    User-agent: *
    Disallow: /cgi-bin/

     

    Is there a way to block files that all begin with:

    /addtocart.sc?productld=13&quantity=1

    /addtocart.sc?productld=14&quantity=1

    /addtocart.sc?productld=23&quantity=1

    etc.?

    Thank you

  61. Guys, I am having more problems realated to the SEO like: my website is made up in asp.net with 3.5 framework and i want a solution of www, home.aspx 301 redirection problem that what code an exactly for my website should be (www.rasavgems.com) and in which file i should be used it please explain it in details with steps.

     

    Thanks

    Ankit

    1. Hello Ankit,

      It is find if you do not want a search engine to crawl your site. If it does not, however, it means those pages may not get updated in the search engine or even show at all. If you wish, allow your favorite search engines to crawl your site at a reasonable delay if you want to show up in them. You can certainly set the file to block the others.

      Kindest Regards,
      Scott M

    2. Hello Ankit,

      I am not sure exactly what it is you are asking. Please try to be a bit more detailed and give us some steps if you can. Also, as this does not seem to be related to the robots.txt file, please reply with a new question.

      Kindest Regards,
      Scott M

  62. Hey Johnpaulb

    i used following kind of the methods :

    # robots.txt generated for google
    User-agent: Googlebot
    Disallow: /
    User-agent: *
    Disallow: / 
    
    
    # robots.txt generated for yahoo
    User-agent: Slurp
    Disallow: /
    User-agent: *
    Disallow: /
    
    
    # robots.txt generated for Msn
    User-agent: MSNBot
    Disallow: /
    User-agent: *
    Disallow: /
    
    
    # robots.txt generated for ask
    User-agent: Teoma
    Disallow: /
    User-agent: *
    Disallow: /
    
    
    # robots.txt generated for bingbot
    User-agent: bingbot
    Disallow: /
    User-agent: *
    Disallow: /
    
    
    please suggest me that , is it okay for my site to stop the search engine for crawling my site. i uploaded a robots.txt file with using such above methods togather in one robots.txt file.
  63. Hello!

    I am currently developing a larger website and while it is still in development I’d prefer that search engines do not crawl through it, that is until I am finished.  This way I can post the site so that multiple developers can code and test without the world knowing the site exists on google and such.  It seems to me that the code above would do that, am I correct in my acessment?

    Thanks,

    Jay

    1. Hello Jay,

      Unfortunately, most search engines, including Bing and Google are paying less attention to the robots.txt file. The best way to prevent anyone else from seeing the site, or having the Search Engines index it until you are ready is to password protect the site via the cPanel.

      Kindest Regards,
      Scott M

  64. With regards to the crawl delay, so do i understand this correctly, if you introduce a longer delay for a bot to crawl your site, it doesn’t reduce the cpu load, merely spreads it out over a longer period ?

    1. Hello Andy,

      Yes you understand the crawl delay for robots correctly, it just causes the robot’s requests to be spread out over a longer time period. But much like a highway dealing with traffic jams, high amounts of usage during short intervals of time can cause back ups and delays, but if the usage is spread out over the course of a day it’s not as noticeable on the highway or server and that’s typically what you’re trying to achieve with a crawl delay.

      Please let us know if you had any further questions at all.

      – Jacob

  65. I would like to disallow semalt and semalt-semalt crawlers from wreaking havoc on my bounce rate. If I use the code to disallow one particular search engine, do I need to write this code twice? Once for each individual crawler? Or maybe a comma between them? Thank you

    1. Hello Mark,

      Thank you for your question. It seems to be a common problem, judging by the amount of search results.

      I found the following solution via online search, where it is blocked by referrer:

      # block visitors referred from semalt.com
      RewriteEngine on
      RewriteCond %{HTTP_REFERER} semalt\.com [NC]
      RewriteRule .* - [F]

      If you have any further questions, feel free to post them below.
      Thank you,

      -John-Paul

  66. I have looked for info about robot.txt on the web numerous times and this is the only one that made sense. thank you so much!!!

    1. As the robots.txt file only determines what files are able to be accessed, unfortunately you would not be able to block a specific domain if it uses the same files as another site that you do want to be accessed.

  67. That idea of blocking search engines worked perfect on my site.

    Thanks for the precise example you have in lower half.

    Abhi

Was this article helpful? Join the conversation!

Server Madness Sale
Score Big with Savings up to 99% Off

X