If you run a website, you’ve probably heard about a robots.txt file (or the “robots exclusion standard”). Whether you have or not, it’s time to learn about it, because this simple text file is a crucial part of your site. It might seem insignificant, but you might be surprised at just how important it is.
Let’s take a look at what a robots.txt file is, what it does, and how to correctly set it up for your site.
What Is a robots.txt File?
To understand how a robots.txt file works, you need to know a bit about search engines. The short version is that they send out “crawlers,” which are programs that scour the internet for information. They then store some of that information so they can direct people to it later.
These crawlers, also known as “bots” or “spiders,” find pages from billions of websites. Search engines give them directions on where to go, but individual websites can also communicate with the bots and tell them which pages they should be looking at.
Most of the time, they’re actually doing the opposite, and telling them which pages they shouldn’t be looking at. Things like administrative pages, backend portals, category and tag pages, and other things that site owners don’t want displayed on search engines. These pages are still visible to users, and they’re accessible to anyone who has permission (which is often everyone).
But by telling those spiders not to index some pages, the robots.txt file does everyone a favor. If you searched for “MakeUseOf” on a search engine, would you want our administrative pages showing up high in the rankings? No. That wouldn’t do anyone any good, so we tell search engines not to display them. It can also be used to keep search engines from checking out pages that might not help them classify your site in search results.
In short, robots.txt tells web crawlers what to do.
Can Crawlers Ignore robots.txt?
Do crawlers ever ignore robots.txt files? Yes. In fact, many crawlers do ignore it. Generally, however, those crawlers aren’t from reputable search engines. They’re from spammers, email harvesters, and other types of automated bots that roam the internet. It’s important to keep this in mind — using the robot exclusion standard to tell bots to keep out isn’t an effective security measure. In fact, some bots might start with the pages you tell them not to go to.
Search engines, however, will do as your robots.txt file says as long as it’s formatted correctly.
How to Write a robots.txt File
There are a few different parts that go into a robot exclusion standard file. I’ll break them each down individually here.
User Agent Declaration
Before you tell a bot which pages it shouldn’t look at, you have to specify which bot you’re talking to. Most of the time, you’ll use a simple declaration that means “all bots.” That looks like this:
The asterisk stands in for “all bots.” You could, however, specify pages for certain bots. To do that, you’ll need to know the name of the bot you’re laying out guidelines for. That might look like this:
User-agent: Googlebot [list of pages not to crawl] User-agent: Googlebot-Image/1.0 [list of pages not to crawl] User-agent: Bingbot [list of pages not to crawl]
And so on. If you discover a bot that you don’t want crawling your site at all, you can specify that, too.
To find the names of user agents, check out useragentstring.com.
This is the main part of your robot exclusion file. With a simple declaration, you tell a bot or group of bots not to crawl certain pages. The syntax is easy. Here’s how you’d disallow access to everything in the “admin” directory of your site:
That line would keep bots from crawling yoursite.com/admin, yoursite.com/admin/login, yoursite.com/admin/files/secret.html, and anything else that falls under the admin directory.
To disallow a single page, just specify it in the disallow line:
Now the “exception” page won’t be drawled, but everything else in the “public” folder will.
To include multiple directories or pages, just list them on subsequent lines:
Disallow: /private/ Disallow: /admin/ Disallow: /cgi-bin/ Disallow: /temp/
Those four lines will apply to whichever user agent you specified at the top of the section.
If you want to keep bots from looking at any page on your site, use this:
Setting Different Standards for Bots
As we saw above, you can specify certain pages for different bots. Combining the previous two elements, here’s what that looks like:
User-agent: googlebot Disallow: /admin/ Disallow: /private/ User-agent: bingbot Disallow: /admin/ Disallow: /private/ Disallow: /secret/
The “admin” and “private” sections will be invisible on Google and Bing, but Google will see the “secret” directory, while Bing won’t.
You can specify general rules for all bots by using the asterisk user agent, and then give specific instructions to bots in subsequent sections, too.
Putting It All Together
With the knowledge above, you can write a complete robots.txt file. Just fire up your favorite text editor (we’re fans of Sublime around here) and start letting bots know that they’re not welcome in certain parts of your site.
If you’d like to see an example of a robots.txt file, just head to any site and add “/robots.txt” to the end. Here’s part of the Giant Bicycles robots.txt file:
As you can see, there are quite a few pages that they don’t want showing up on search engines. They’ve also included a few things we haven’t talked about yet. Let’s take a look at what else you can do in your robot exclusion file.
Locating Your Sitemap
If your robots.txt file tells bots where not to go, your sitemap does the opposite, and helps them find what they’re looking for. And while search engines probably already know where your sitemap is, it doesn’t hurt to let them know again.
The declaration for a sitemap location is simple:
Sitemap: [URL of sitemap]
In our own robots.txt file, it looks like this:
That’s all there is to it.
Setting a Crawl Delay
The crawl delay directive tells certain search engines how often they can index a page on your site. It’s measured in seconds, though some search engines interpret it slightly differently. Some see a crawl delay of 5 as telling them to wait five seconds after every crawl to initiate the next one. Others interpret it as an instruction to only crawl one page every five seconds.
Why would you tell a crawler not to crawl as much as possible? To preserve bandwidth. If your server is struggling to keep up with traffic, you may want to institute a crawl delay. In general, most people don’t have to worry about this. Large high-traffic sites, however, may want to experiment a bit.
Here’s how you set a crawl delay of eight seconds:
That’s it. Not all search engines will obey your directive. But it doesn’t hurt to ask. Like with disallowing pages, you can set different crawl delays for specific search engines.
Uploading Your robots.txt File
Once you have all of the instructions in your file set up, you can upload it to your site. Make sure it’s a plain text file, and has the name robots.txt. Then upload it to your site so it can be found at yoursite.com/robots.txt.
If you use a content management system like WordPress, there’s probably a specific way you’ll need to go about this. Because it differs in each content management system, you’ll need to consult the documentation for your system.
Some systems may have online interfaces for uploading your file, as well. For these, just copy and paste the file you created in the previous steps.
Remember to Update Your File
The last piece of advice I’ll give is to occasionally look over your robot exclusion file. Your site changes, and you may need to make some adjustments. If you notice a strange change in your search engine traffic, it’s a good idea to check out the file, too. It’s also possible that the standard notation could change in the future. Like everything else on your site, it’s worth checking up on it every once in a while.
Which pages do you exclude crawlers from on your site? Have you noticed any difference in search engine traffic? Share your advice and comments below!
Explore more about: Web Development.