Saturday, February 6, 2010

Robot.Txt Guidelines

What is a WWW robot?

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.

What is an agent?

The word "agent" is used for lots of meanings in computing these days. Specifically:

Autonomous agents

are programs that do travel between sites, deciding themselves when to move and what to do. These can only travel between special servers and are currently not widespread in the Internet.

Intelligent agents

are programs that help users with things, such as choosing a product, or guiding a user through form filling, or even helping users find things. These have generally little to do with networking.

User-agent

is a technical name for programs that perform networking tasks for a user, such as Web User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent like Qualcomm Eudora etc.

What is a search engine?

A search engine is a program that searches through some dataset. In the context of the Web, the word "search engine" is most often used for search forms that search through databases of HTML documents gathered by a robot.

What kinds of robots are there?

Robots can be used for a number of purposes:

  • Indexing
  • HTML validation
  • Link validation
  • "What's New" monitoring
  • Mirroring

So what are Robots, Spiders, Web Crawlers, Worms, Ants?

Robots

the generic name, see above.

Spiders

same as robots, but sounds cooler in the press.

Worms

same as robots, although technically a worm is a replicating program, unlike a robot.

Web crawlers

same as robots, but note WebCrawler is a specific robot

WebAnts

distributed cooperating robots.

Aren't robots bad for the web?

There are a few reasons people believe robots are bad for the Web:

  • Certain robot implementations can (and have in the past) overloaded networks and servers. This happens especially with people who are just starting to write a robot; these days there is sufficient information on robots to prevent some of these mistakes.
  • Robots are operated by humans, who make mistakes in configuration, or simply don't consider the implications of their actions. This means people need to be careful, and robot authors need to make it difficult for people to make mistakes with bad effects.
  • Web-wide indexing robots build a central database of documents, which doesn't scale too well to millions of documents on millions of sites.

But at the same time the majority of robots are well designed, professionally operated, cause no problems, and provide a valuable service in the absence of widely deployed better solutions.

So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.

How does a robot decide where to visit?

This depends on the robot, each one uses different strategies. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.

Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.

Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.

Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.

How does an indexing robot decide what to index?

If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML Titles, or the first few paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML constructs, etc. Some parse the META tag, or other special hidden tags.

We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...

About /robots.txt

In a nutshell

Web site owners use the /robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

The details

The /robots.txt is a de-facto standard, and is not owned by any standards body. There are two historical descriptions:

In addition there are external resources:

The /robots.txt standard is not actively developed. See What about further development of /robots.txt? for more discussion.

The rest of this page gives an overview of how to use /robots.txt on your server, with some simple recipes. To learn more see also the FAQ.

How to create a /robots.txt file

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

What to put in it

The "/robots.txt" file is a text file, with one or more records. Usually contains a single record looking like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/

In this example, three directories are excluded.

Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot say "Disallow: /cgi-bin/ /tmp/" on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: *bot*", "Disallow: /tmp/*" or "Disallow: *.gif".

What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:

To exclude all robots from the entire server
User-agent: *
Disallow: /
 
To allow all robots complete access
User-agent: *
Disallow:

(or just create an empty "/robots.txt" file, or don't use one at all)

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: Google
Disallow:
 
User-agent: *
Disallow: /
To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html

About the Robots <META> tag

In a nutshell

You can use a special HTML <META> tag to tell robots not to index the content of a page, and/or not scan it for links to follow.

For example:

...

There are two important considerations when using the robots <META> tag:

  • robots can ignore your <META> tag. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the NOFOLLOW directive only applies to links on this page. It's entirely likely that a robot might find the same links on some other page without a NOFOLLOW (perhaps on some other site), and so still arrives at your undesired page.

Don't confuse this NOFOLLOW with the rel="nofollow" link attribute.

The details

Like the /robots.txt, the robots META tag is a de-facto standard. It originated from a "birds of a feather" meeting at a 1996 distributed indexing workshop, and was described in meeting notes.

The META tag is also described in the HTML 4.01 specification, Appendix B.4.1.

The rest of this page gives an overview of how to use the robots <META> tags in your pages, with some simple recipes. To learn more see also the FAQ.

How to write a Robots Meta Tag

Where to put it

Like any <META> tag it should be placed in the HEAD section of an HTML page, as in the example above. You should put it in every page on your site, because a robot can encounter a deep link to any page on your site.

What to put into it

The "NAME" attribute must be "ROBOTS".

Valid values for the "CONTENT" attribute are: "INDEX", "NOINDEX", "FOLLOW", "NOFOLLOW". Multiple comma-separated values are allowed, but obviously only some combinations make sense. If there is no robots <META> tag, the default is "INDEX,FOLLOW", so there's no need to spell that out. That leaves:

<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

How do I register my page with a robot?

You guessed it, it depends on the service :-) Many services have a link to a URL submission form on their search page, or have more information in their help pages. For example, Google has Information for Webmasters.

How do I get the best listing in search engines?

This is referred to as "SEO" -- Search Engine Optimisation. Many web sites, forums, and companies exist that aim/claim to help with that.

But it basically comes down to this:

  • In your site design, use text rather than images and Flash for important content
  • Make your site work with JavaScript, Java and CSS disabled
  • Organise your site such that you have pages that focus on a particular topic
  • Avoid HTML frames and iframes
  • Use normal URLs, avoiding links that look like form queries (http://www.example.com/engine?id)
  • Market your site by having other relevant sites link to yours
  • Don't try to cheat the system (by stuffing your pages of keywords, or attempting to target specific content at search engines, or using link farms)

Can I use /robots.txt or meta tags to remove offensive content on some other site from a search engine?

No, because those tools can only be used by the person controlling the content on that site.

You will have to contact the site and ask them to remove the offensive content, and ask them to take steps to remove it from the search engine too. That usually involves using /robots.txt, and then using the search engine's tools to request the content to be removed. For example, see: How can I prevent content from being indexed or remove content from Google's index.

If that fails, you can try contacting the search engine administrators directly to ask for help, but they are likely to only remove content if it is a legal matter. For example, see: How can I inform Google about a legal matter?

How do I know if I've been visited by a robot?

You can check your server logs for sites that retrieve many documents, especially in a short time.

If your server supports User-agent logging you can check for retrievals with unusual User-agent header values.

Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.

I've been visited by a robot! Now what?

Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.

If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!

A robot is traversing my whole site too fast!

This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log file.

First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.

However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.

If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.

If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.

Why do I find entries for /robots.txt in my log files?

They are probably from robots trying to see if you have specified any rules for them using the Standard for Robot Exclusion, see also below.

If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.

Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)

How do I prevent robots scanning my site?

The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on your server:

User-agent: *
Disallow: /

but this only helps with well-behaved robots.

Where do I find out how /robots.txt files work?

You can read the whole standard specification but the basic concept is simple: by writing a structured text file you can indicate to robots that certain parts of your server are off-limits to some or all robots. It is best explained with an example:

# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism
 
User-agent: webcrawler
Disallow:
 
User-agent: lycra
Disallow: /
 
User-agent: *
Disallow: /tmp
Disallow: /logs

The first two lines, starting with '#', specify a comment

The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.

The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.

The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or regular expressions in either User-agent or Disallow lines.

Two common errors:

  • Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp/'.
  • You shouldn't put more than one path on a Disallow line (this may change in a future version of the spec)

How do I use /robots.txt on a virtual host?

The term "virtual host" is sometimes use to mean various different things:

A "virtual host" web server uses the HTTP Host Header to distinguish requests to different domain names on the same IP address. In this case the fact that the domain is on a shared host makes no difference to a visitng robot, and you can put a /robots.txt file in the directory dedicated to your domain.

A "virtual server" runs a separate operating system on a virtual machine, like VMWare or Xen. Again, to a robot that's a separate computer.

How do I use /robots.txt on a shared host?

If you share a host with other people, and you have a URL like http://www.example.com/~username/ or http://www.example.com/username, then you can't have your own /robots.txt file. If you want to use /robots.txt you'll have to ask the host administrator to help you.

What about further development of /robots.txt?

There are no efforts on this site to further develop /robots.txt, and I am not aware of technical standards bodies like the IETF or W3C working in this area.

There are some industry efforts to extend robots exclusion mechanisms. See for example the collaborative efforts announced on Yahoo! Search Blog, Google Webmaster Central Blog, and Microsoft Live Search Webmaster Team Blog, which includes wildcard support, sitemaps, extra META tags etc.

It is of course important to realise that other, older robots may not support these newer mechanisms. For example, if you use "Disallow: /*.pdf$", and a robot does not treat '*' and '$' as wildcard and anchor characters, then your PDF files are not excluded.

What if I cannot make a /robots.txt?

Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

The basic idea is that if you include a tag like:

<META NAME="ROBOTS" CONTENT="NOINDEX">

in your HTML document, that document won't be indexed.

If you do:

<META NAME="ROBOTS" CONTENT="NOFOLLOW">

the links in that document will not be parsed by the robot.

Can I block just bad robots?

In theory yes, in practice, no. If the bad robot obeys /robots.txt, and you know the name it scans for in the User-Agent field. then you can create a section in your /robotst.txt to exclude it specifically. But almost all bad robots ignore /robots.txt, making that pointless.

If the bad robot operates from a single IP address, you can block its access to your web server through server configuration or with a network firewall.

If copies of the robot operate at lots of different IP addresses, such as hijacked PCs that are part of a large Botnet, then it becomes more difficult. The best option then is to use advanced firewall rules configuration that automatically block access to IP addresses that make many connections; but that can hit good robots as well your bad robots.

Why did this robot ignore my /robots.txt?

It could be that it was written by an inexperienced software writer. Occasionally schools set their students "write a web robot" assignments. But, these days it's more likely that the robot is explicitly written to scan your site for information to abuse: it might be collecting email addresses to send email spam, look for forms to post links ("spamdexing"), or security holes to exploit.

Can a /robots.txt be used in a court of law?

There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.txt can be relevant in legal cases.

Obviously, IANAL, and if you need legal advice, obtain professional services from a qualified lawyer.

Some high-profile cases involving /robots.txt:

And you'll find lots of discussion over at Groklaw.

Surely listing sensitive files is asking for trouble?

Some people are concerned that listing pages or directories in the /robots.txt file may invite unintended access. There are two answers to this.

The first answer is a workaround: You could put all the files you don't want robots to visit in a separate sub directory, make that directory un-listable on the web (by configuring your server), then place your files in there, and list only the directory name in the /robots.txt. Now an ill-willed robot won't traverse that directory unless you or someone else puts a direct link on the web to one of your files, and then it's not /robots.txt fault.

For example, rather than:

User-Agent: *
Disallow: /foo.html
Disallow: /bar.html

do:

User-Agent: *
Disallow: /norobots/

and make a "norobots" directory, put foo.html and bar.html into it, and configure your server to not generate a directory listing for that directory. Now all an attacker would learn is that you have a "norobots" directory, but he won't be able to list the files in there; he'd need to guess their names.

However, in practice this is a bad idea -- it's too fragile. Someone may publish a link to your files on their site. Or it may turn up in a publicly accessible log file, say of you user's proxy server, or maybe it will show up in someone's web server log as a Referer. Or someone may misconfigure your server at some future date, "fixing" it to show a directory listing. Which leads me to the real answer:

The real answer is that /robots.txt is not intended for access control, so don't try to use it as such. Think of it as a "No Entry" sign, not a locked door. If you have files on your web site that you don't want unauthorized people to access, then configure your server to do authentication, and configure appropriate authorization. Basic Authentication has been around since the early days of the web (and in e.g. Apache on UNIX is trivial to configure). Modern content management systems support access controls on individual pages and collections of resources.

What is the rel="nofollow" link attribute?

The rel="nofollow" is an attribute you can set on an HTML link tag, invented by Google, and adopted by others. Those links won't get any credit when Google ranks websites in the search results, thus removing the main incentive behind blog comment spammers robots.

See Preventing comment spam on the Official Google Blog.

From that description it sounds like it only affects the ranking, and the Google robot may still follow the links and index them. If so, it is different from the robots meta tag NOFOLLOW semantics.