Importance of Robots.txt

Few webmasters and fewer website owners realize that their website is incomplete without a good robots.txt file. It is a little known and often ignored fact that a good robots.txt file is very important if you want to achieve success with search engines. This article will explain the importance of a robots.txt file if you don’t have one, and help you improve it if you already have one.

What is robots.txt?

Every time a search engine visits your website, the first thing it looks for is the robots.txt file. All search engines respect the robots.txt file and look for it to guide them through the site, telling them which pages to visit and index, which pages NOT to visit and more. The robots.txt is a simple yet extremely powerful file and a must-have for every website.

Getting started with robots.txt

As the name suggests, a robots.txt file is a text file — not an HTML page. It is located in the “root folder” of your website. In other words, if your website is “mycompany.com”, then this file should be accessible at: https://www.mycompany.com/robots.txt.

Creating a robots.txt file

All you need is a text editor like Notepad to create a robots.txt file. Start a new text file with Notepad, save it as “robots.txt” and you are ready to go.

Contents of robots.txt

A robots.txt file can contain a number of lines. These lines contain instruction to search engines called “records”. Each record in the robots.txt file consists of two elements a) User Agent and b) Instruction. The User Agent line allows you to give instructions to specific search engines or to all search engines. Through the instruction line, you can indicate which content the search engine should ignore, where the sitemaps are located, etc.

Adding records to the robots.txt

A simple instruction to Google asking it to ignore the Staging folder in your website where new versions or under development pages in your site are tested would be:
User-agent: googlebot
Disallow: /staging/

Getting the instructions in the robots.txt right

It is important to make sure that you get the instructions in the robots.txt file right or you can end up asking the search engines to ignore the wrong things. Below are a few examples:
Let’s say you have the following files and folders in your website

/service/ a folder containing multiple pages that explain the services offered by the company
/service-center/ a folder where you test various aspects of the website
/service-international/ a folder where you test various aspects for implementing on your international sites
/services/index.html the first page in the service folder

Now, let’s assume you wanted to give instructions to Google. The first line would specify the user agent you want to address. In this case it would read:
User-agent: googlebot

The line after this would provide instructions to googlebot. The table below shows various examples of this instruction line and its result.

Instruction in robots.txt	Result
Disallow: /service/index.html	Search engines will ignore only the index.html file in the /service/ folder but will visit, crawl and index the contents of all the other files in the folder.
Disallow: /service/	Search engines will ignore all files inside the /service/ folder including the index.html file.
Disallow: /service	Search engines will ignore all folders and their contents where the folder name starts with /service. This command is the same as /service*. The result will be that /service/, /service-center/, /service-international, /services/ and all the contents of all these folders will be ignored by search engines
Disallow: /service-	Search engines will ignore all folders and their contents where the folder name starts with /service-. The result will be that /service-center/, /service-international, and all the contents of both these folders will be ignored by search engines. However, /service/ with its contents and /services/ with its contents will be visited, crawled and indexed.

As you can see from the above, a slight mistake in the instructions can result in search engines visiting or ignoring the wrong things. It is very important that write the instructions in the robots.txt file with care. If you wanted Google only to ignore the /service-center/ folder and its contents, the lines in your robots.txt file should have been:

User-agent: googlebot
Disallow: /service-center/
Instead of just giving instructions to Googlebot, if you wanted ALL search engines to ignore this folder, the lines in the robots.txt would read as follows:
User-agent: *
Disallow: /service-center/

Can I use the robots.txt to “allow”?

All search engines assume that the entire website is available for crawling and indexing. By extension, it means that you do not need to “allow” any content or specify which pages search engines should access — with the exception of the XML sitemap. The main purpose of the robots.txt file is to instruct search engines about which pages or folders they should ignore because they contain sensitive information, executable files, work in progress pages, etc.

XML sitemaps and robots.txt file

Another important use of the robots.txt file is to point search engines to where your XML sitemaps are located. This instruction should always be AFTER all the Disallow instructions. When search engines find the sitemaps, they will know which pages in the sitemap to ignore, if the Disallow instructions are listed before listing the sitemap in the robots.txt file. If your sitemap.xml file is located in your root folder, the instruction pointing to the sitemap at the end of your robots.txt file would read as follows:
Sitemap: https://www.mycompany.com/sitemap.xml

Do I need a robots.txt?

Strictly speaking, if your website is a static website with just a few pages and you do not want to disallow the crawling of any of its content, not having a robots.txt file would not hurt you. However, you could be missing out on a simple opportunity to send search engines to your sitemap, even if you don’t want to disallow anything.

Things to avoid in the robots.txt?

Do not disallow the crawling of the entire site. Do not use the instruction
Disallow: / this will result in search engines ignoring the entire contents of your website
Don’t use comments in the file because they can sometimes result in wrong instructions

While a robots.txt file is not necessary, most professional website design companies recommend using it to restrict certain pages and to point search engines to your XML sitemaps. At Flying Cow Design, we treat this as a web development best practice and build it for all our clients

Peter Brown

CEO, Flying Cow Design
Attended University of Auckland
Lives in San Francisco Bay Area