What Is A Robots.txt File?

The Robots.txt file lets you instruct search engine spiders which directories and files it is allowed to crawl and index.

When a compliant search engine robot visits a site, it first checks for a 'robots.txt' file on the server. If the file exists, the robot reads the contents for instructions on what it can crawl and index. Please note that search engine robots have no obligation to follow the instructions given in the Robots.txt file. However, most search engine robots do honor them.

Why Would You Use A Robots.txt File?

There are a number of reasons why you may want to stop a search engine from crawling and indexing particular sections of your site.

These include:

  • Search engine optimized pages. For example, let's say you optimize a webpage specifically Google, AltaVista and Inktomi. You would not want one engine to index the pages designed for the other engines, otherwise they could view them as duplicate copies designed to spam their index, which could result in a ban.
  • Hiding sensitive content, such as internal reports and content not ready to be published.

How To Create A Robots.txt File

To create a robots.txt file:

1. Create a blank text file using a text editor that can save ASCII .txt files. You can use WordPad or Notepad which comes with Windows. You should be able to find them under: Start Menu -> Programs -> Accessories.

2. Insert instructions for each search engine robot, using the following syntax:

User-agent: Robot Name

Disallow: File or Directory Name

User-agent - A user-agent is the name of the search engine robot. You may also include more than one agent name if the same exclusion is to apply to them all. You do not need to worry about case sensitivity, so "googlebot" is the same as "GOOGLEBOT." An asterisk "*" indicates all robots.

Disallow - Disallow instructs the robot specified in the user-agent which directories or files you do not want crawled or indexed.

Here are some sample robots.txt file instructions:

Exclude all robots from the entire website (not recommended!):

User-agent: *
Disallow: /

Allow all robots to a website. Because nothing is disallowed, everything is allowed:

User-agent: *
Disallow:

Alternatively, create a blank robots.txt file.

Allow all robots access to all files and directories, except the two directories listed:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/

Allow Google's robot access to all files and directories, except the cgi-bin directory:

User-agent: Googlebot
Disallow: /cgi-bin/

Allow Google's robot access to all files and directories, except the file listed:

User-agent: Googlebot
Disallow: /members/login.html

A blank line indicates a new "record" - a new user-agent command. Disallow Googlebot from the website and all other robots access to everything, except the cgi-bin directory:

User-agent: Googlebot
Disallow: /

User-agent: *
Disallow: /cgi-bin/

Allow Googlebot complete access, but exclude all other robots:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

Disallow all robots access to a file or directory beginning with a certain value:

User-agent: *
Disallow: /image
Disallow: /image/

The first disallow command disallows the /image/ directory, /image.html and /images.html files.

The second disallow command only disallows the /image/ directory, but allows other files such as /image.html and images.html.

You may add as many lines to exclude as many directories and pages as you wish. Each disallow statement will be applied to the last user-agent that was specified.

Instruct all robots not to crawl dynamically generated pages:

User-agent: *
Disallow: /*?

Instruct all robots not to crawl files ending in .gif:

User-agent: *
Disallow: /*.gif

3. Save the file as "robots.txt" (must be all lower case) and upload it to the root or top directory on your server.

For example:

http://www.yoursite.com/robots.txt <-- root directory

Note that there can only be one "robots.txt" file for each site.