Creating robots.txt file for site

Avatar
  • Answered
To whom it may concern,

I would like assistance on creating an effective robots.txt file for my site. I've reviewed the article about how to block robots, but still I am not too sure as to the best code needed for my specifications. Ultimately, I would like the following (if possible) out of a robots.txt file:

- block spammers - i.e. junk mail and others that could compromise site security/integrity
- enable search engines to index pages/posts on my site
- anything else that could assist in SEO (page rank)

Any assistance would be greatly appreciated.

Regards

Site: http://picneewatch.com
Avatar
Scott
Hello Rockit, The code you listed should stop the MSNbot. Are you still seeing it after enabling that code? While you can do a delay, personally I would use the block. The format for a delay, should you decide to use it, is below:
User-agent: * Crawl-delay:
Kindest Regards, Scott M
Avatar
JacobIMH
Hello picneewatch, and thank you for your question. I'm assuming the article you read was more than likely our one about stopping search engines from crawling your website. Now with a robots.txt file you can control the way in which search engine crawlers, bots, and spiders index your website. But it is important to note that typically the search engines that are going to abide by these these rules, are good, popular search engines like Google, Bing, Yahoo and so forth. In your requirements for what you'd like to do with your robots.txt file, you can attempt to block spammers, or bad robots. But they aren't going to follow these rules in all cases, as the robots.txt rules are not enforced by the server, but rather by the bot reading the file and they can completely ignore the rules. I've written two pretty extensive articles on how to identify and block bad robots from your website, as well as how to block unwanted users from your site using .htaccess. Both of these methods utilize the server's .htaccess file, which can enforce your rules directly on the server-side and don't let the robots pick and choose what to follow. Now in that first article, the steps mentioned to discover bad bots would actually require a VPS with SSH access, or to download your raw access logs in cPanel and sort them that way. Here is a snapshot of your current User-Agents from today's access-logs sorted by the number of hits:
1 Mediapartners-Google
1 Mozilla/5.0 (compatible; NetSeer crawler/2.0)
1 Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)
3 Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
3 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) Safari/534.30
4 Googlebot-Image/1.0
5 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
6 Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)
16 Mozilla/5.0 (compatible; MJ12bot; http://www.majestic12.co.uk/bot.php?+)
38 Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20100101 Firefox/22.0
41 Mozilla/5.0 (Windows NT 6.1) Safari/537.36
48 Mozilla/4.0 (compatible; MSIE 8.0)
96 msnbot/2.0b (+http://search.msn.com/msnbot.htm)
623 Mozilla/5.0 (compatible; MSIE 10.0)
So if you wanted to control all of these robots via your robots.txt file, allowing them all access, but restricting them to only crawl 1 page every 30 seconds you'd use:
User-Agent: *
Crawl-delay: 30
You could then add below that rule, for instance if you didn't want to allow the MJ12bot robot to crawl your website at all:
User-Agent: MJ12bot
Disallow: /
A lot of the times you'll notice at the end of the User-Agent string there will be a URL such as http://www.majestic12.co.uk/bot.php?+. When you go there, in this case it tells you the exact line in the robots.txt file you should use to block them. Now not all bots like I said are going to provide this tidbit of information, or in some cases I've seen that they do, but they still go and ignore the rules you've placed in robots.txt. This is where using .htaccess rules instead to restrict access server-side come into play. For instance based on your bots, I'd recommend blocking MJ12bot, AhrefsBot, proximic, and NetSeer. Unless you feel a strong need to allow them for some reason. In your .htaccess file you'd use the following code to deny them all access:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^.*(AhrefsBot|MJ12bot|NetSeer|proximic).*$ [NC]
RewriteRule .* - [R=403,L]
This would give any User-Agent that mentions those bots in it a 403 access denied error, and not allow them to index your site or use up any server resources. Now in regards to SEO, so long as you're allowing the major search engines to crawl and index your site you should be fine. There really isn't anything to do in your robots.txt or .htaccess files to increase your page rank, other than most CMS applications these days allow you to use SEF or Search Engine Friendly URLs and this is handled by the .htaccess file. I see you already have this setup, as search engines prefer URLs like:
how-we-can-help-you/for-individuals/elder-care/
Instead of a more code friendly URL like:
index.php?catid=10&articleid=45
The field of SEO has actually drastically changed this year and you might be interested in reading about Google Penguin and the impact on SEO. Basically going forward sites that really try to focus on unique quality content, over small tricks to try to get to the front page, are going to be the ones holding those top search positions. I hope that clearly answered your questions. I'm sure you might have some follow-up ones as this is a lot of information. So please let us know if you needed anything else at all! - Jacob