Few webmasters and fewer website owners realize that their website is incomplete without a good robots.txt file. It is a little known and often ignored fact that a good robots.txt file is very important if you want to achieve success with search engines. This article will explain the importance of a robots.txt file if you don’t have one, and help you improve it if you already have one.
What is robots.txt?
Every time a search engine visits your website, the first thing it looks for is the robots.txt file. All search engines respect the robots.txt file and look for it to guide them through the site, telling them which pages to visit and index, which pages NOT to visit and more. The robots.txt is a simple yet extremely powerful file and a must-have for every website.
Getting started with robots.txt
As the name suggests, a robots.txt file is a text file — not an HTML page. It is located in the “root folder” of your website. In other words, if your website is “mycompany.com”, then this file should be accessible at: https://www.mycompany.com/robots.txt.
Creating a robots.txt file
All you need is a text editor like Notepad to create a robots.txt file. Start a new text file with Notepad, save it as “robots.txt” and you are ready to go.
Contents of robots.txt
A robots.txt file can contain a number of lines. These lines contain instruction to search engines called “records”. Each record in the robots.txt file consists of two elements a) User Agent and b) Instruction. The User Agent line allows you to give instructions to specific search engines or to all search engines. Through the instruction line, you can indicate which content the search engine should ignore, where the sitemaps are located, etc.
Adding records to the robots.txt
A simple instruction to Google asking it to ignore the Staging folder in your website where new versions or under development pages in your site are tested would be:
User-agent: googlebot
Disallow: /staging/
Getting the instructions in the robots.txt right
It is important to make sure that you get the instructions in the robots.txt file right or you can end up asking the search engines to ignore the wrong things. Below are a few examples:
Let’s say you have the following files and folders in your website
/service/ a folder containing multiple pages that explain the services offered by the company
/service-center/ a folder where you test various aspects of the website
/service-international/ a folder where you test various aspects for implementing on your international sites
/services/index.html the first page in the service folder
Now, let’s assume you wanted to give instructions to Google. The first line would specify the user agent you want to address. In this case it would read:
User-agent: googlebot
The line after this would provide instructions to googlebot. The table below shows various examples of this instruction line and its result.
|
As you can see from the above, a slight mistake in the instructions can result in search engines visiting or ignoring the wrong things. It is very important that write the instructions in the robots.txt file with care. If you wanted Google only to ignore the /service-center/ folder and its contents, the lines in your robots.txt file should have been:
User-agent: googlebot
Disallow: /service-center/
Instead of just giving instructions to Googlebot, if you wanted ALL search engines to ignore this folder, the lines in the robots.txt would read as follows:
User-agent: *
Disallow: /service-center/
Can I use the robots.txt to “allow”?
All search engines assume that the entire website is available for crawling and indexing. By extension, it means that you do not need to “allow” any content or specify which pages search engines should access — with the exception of the XML sitemap. The main purpose of the robots.txt file is to instruct search engines about which pages or folders they should ignore because they contain sensitive information, executable files, work in progress pages, etc.
XML sitemaps and robots.txt file
Another important use of the robots.txt file is to point search engines to where your XML sitemaps are located. This instruction should always be AFTER all the Disallow instructions. When search engines find the sitemaps, they will know which pages in the sitemap to ignore, if the Disallow instructions are listed before listing the sitemap in the robots.txt file. If your sitemap.xml file is located in your root folder, the instruction pointing to the sitemap at the end of your robots.txt file would read as follows:
Sitemap: https://www.mycompany.com/sitemap.xml
Do I need a robots.txt?
Strictly speaking, if your website is a static website with just a few pages and you do not want to disallow the crawling of any of its content, not having a robots.txt file would not hurt you. However, you could be missing out on a simple opportunity to send search engines to your sitemap, even if you don’t want to disallow anything.
Things to avoid in the robots.txt?
- Do not disallow the crawling of the entire site. Do not use the instruction
Disallow: / this will result in search engines ignoring the entire contents of your website - Don’t use comments in the file because they can sometimes result in wrong instructions
While a robots.txt file is not necessary, most professional website design companies recommend using it to restrict certain pages and to point search engines to your XML sitemaps. At Flying Cow Design, we treat this as a web development best practice and build it for all our clients
CEO, Flying Cow Design
Attended University of Auckland
Lives in San Francisco Bay Area