How to Correctly Set Up Robots.txt for Your Site
Pinterest Stumbleupon Whatsapp
Advertisement

If you run a website 10 Ways To Create A Small And Simple Website Without The Overkill 10 Ways To Create A Small And Simple Website Without The Overkill WordPress can be an overkill. As these other excellent services prove, WordPress is not the be all and end all of website creation. If you want simpler solutions, there's a variety to pick from. Read More , you’ve probably heard about a robots.txt file (or the “robots exclusion standard”). Whether you have or not, it’s time to learn about it, because this simple text file is a crucial part of your site. It might seem insignificant, but you might be surprised at just how important it is.

Let’s take a look at what a robots.txt file is, what it does, and how to correctly set it up for your site.

What Is a robots.txt File?

To understand how a robots.txt file works, you need to know a bit about search engines How Do Search Engines Work? How Do Search Engines Work? To many people, Google IS the internet. It's arguably the most important invention since the Internet itself. And while search engines have changed a lot since, the underlying principles are still the same. Read More . The short version is that they send out “crawlers,” which are programs that scour the internet for information. They then store some of that information so they can direct people to it later.

These crawlers, also known as “bots” or “spiders,” find pages from billions of websites. Search engines give them directions on where to go, but individual websites can also communicate with the bots and tell them which pages they should be looking at.

Most of the time, they’re actually doing the opposite, and telling them which pages they shouldn’t be looking at. Things like administrative pages, backend portals, category and tag pages, and other things that site owners don’t want displayed on search engines. These pages are still visible to users, and they’re accessible to anyone who has permission (which is often everyone).

But by telling those spiders not to index some pages, the robots.txt file does everyone a favor. If you searched for “MakeUseOf” on a search engine, would you want our administrative pages showing up high in the rankings? No. That wouldn’t do anyone any good, so we tell search engines not to display them. It can also be used to keep search engines from checking out pages that might not help them classify your site in search results.

In short, robots.txt tells web crawlers what to do.

Can Crawlers Ignore robots.txt?

Do crawlers ever ignore robots.txt files? Yes. In fact, many crawlers do ignore it. Generally, however, those crawlers aren’t from reputable search engines. They’re from spammers, email harvesters, and other types of automated bots How To Build A Basic Web Crawler To Pull Information From A Website (Part 1) How To Build A Basic Web Crawler To Pull Information From A Website (Part 1) Read More that roam the internet. It’s important to keep this in mind — using the robot exclusion standard to tell bots to keep out isn’t an effective security measure. In fact, some bots might start with the pages you tell them not to go to.

Search engines, however, will do as your robots.txt file says as long as it’s formatted correctly.

How to Write a robots.txt File

There are a few different parts that go into a robot exclusion standard file. I’ll break them each down individually here.

User Agent Declaration

Before you tell a bot which pages it shouldn’t look at, you have to specify which bot you’re talking to. Most of the time, you’ll use a simple declaration that means “all bots.” That looks like this:

User-agent: *

The asterisk stands in for “all bots.” You could, however, specify pages for certain bots. To do that, you’ll need to know the name of the bot you’re laying out guidelines for. That might look like this:

User-agent: Googlebot
[list of pages not to crawl]
User-agent: Googlebot-Image/1.0
[list of pages not to crawl]
User-agent: Bingbot
[list of pages not to crawl]

And so on. If you discover a bot that you don’t want crawling your site at all, you can specify that, too.

To find the names of user agents, check out useragentstring.com.

Disallowing Pages

This is the main part of your robot exclusion file. With a simple declaration, you tell a bot or group of bots not to crawl certain pages. The syntax is easy. Here’s how you’d disallow access to everything in the “admin” directory of your site:

Disallow: /admin/

That line would keep bots from crawling yoursite.com/admin, yoursite.com/admin/login, yoursite.com/admin/files/secret.html, and anything else that falls under the admin directory.

To disallow a single page, just specify it in the disallow line:

Disallow: /public/exception.html

Now the “exception” page won’t be drawled, but everything else in the “public” folder will.

To include multiple directories or pages, just list them on subsequent lines:

Disallow: /private/
Disallow: /admin/
Disallow: /cgi-bin/
Disallow: /temp/

Those four lines will apply to whichever user agent you specified at the top of the section.

If you want to keep bots from looking at any page on your site, use this:

Disallow: /

Setting Different Standards for Bots

As we saw above, you can specify certain pages for different bots. Combining the previous two elements, here’s what that looks like:

User-agent: googlebot
Disallow: /admin/
Disallow: /private/

User-agent: bingbot
Disallow: /admin/
Disallow: /private/
Disallow: /secret/

The “admin” and “private” sections will be invisible on Google and Bing, but Google will see the “secret” directory, while Bing won’t.

You can specify general rules for all bots by using the asterisk user agent, and then give specific instructions to bots in subsequent sections, too.

Putting It All Together

With the knowledge above, you can write a complete robots.txt file. Just fire up your favorite text editor (we’re fans of Sublime 11 Sublime Text Tips for Productivity and a Faster Workflow 11 Sublime Text Tips for Productivity and a Faster Workflow Sublime Text is a versatile text editor and a gold standard for many programmers. Our tips focus on efficient coding, but general users will appreciate the keyboard shortcuts. Read More around here) and start letting bots know that they’re not welcome in certain parts of your site.

If you’d like to see an example of a robots.txt file, just head to any site and add “/robots.txt” to the end. Here’s part of the Giant Bicycles robots.txt file:

giant robots.txt file

As you can see, there are quite a few pages that they don’t want showing up on search engines. They’ve also included a few things we haven’t talked about yet. Let’s take a look at what else you can do in your robot exclusion file.

Locating Your Sitemap

If your robots.txt file tells bots where not to go, your sitemap does the opposite How To Create A XML Sitemap In 4 Easy Steps How To Create A XML Sitemap In 4 Easy Steps There are two types of sitemaps - HTML page or an XML file. An HTML sitemap is a single page that shows visitors all the pages on a website and usually has links to those... Read More , and helps them find what they’re looking for. And while search engines probably already know where your sitemap is, it doesn’t hurt to let them know again.

The declaration for a sitemap location is simple:

Sitemap: [URL of sitemap]

That’s it.

In our own robots.txt file, it looks like this:

Sitemap: //www.makeuseof.com/sitemap_index.xml

That’s all there is to it.

Setting a Crawl Delay

The crawl delay directive tells certain search engines how often they can index a page on your site. It’s measured in seconds, though some search engines interpret it slightly differently. Some see a crawl delay of 5 as telling them to wait five seconds after every crawl to initiate the next one. Others interpret it as an instruction to only crawl one page every five seconds.

Why would you tell a crawler not to crawl as much as possible? To preserve bandwidth 4 Ways Windows 10 Is Wasting Your Internet Bandwidth 4 Ways Windows 10 Is Wasting Your Internet Bandwidth Is Windows 10 wasting your internet bandwidth? Here's how to check, and what you can do to stop it. Read More . If your server is struggling to keep up with traffic, you may want to institute a crawl delay. In general, most people don’t have to worry about this. Large high-traffic sites, however, may want to experiment a bit.

Here’s how you set a crawl delay of eight seconds:

Crawl-delay: 8

That’s it. Not all search engines will obey your directive. But it doesn’t hurt to ask. Like with disallowing pages, you can set different crawl delays for specific search engines.

Uploading Your robots.txt File

Once you have all of the instructions in your file set up, you can upload it to your site. Make sure it’s a plain text file, and has the name robots.txt. Then upload it to your site so it can be found at yoursite.com/robots.txt.

If you use a content management system 10 Most Popular Content Management Systems Online 10 Most Popular Content Management Systems Online The days of hand-coded HTML pages, and mastering CSS, are long gone. Install a content management system (CMS) and within minutes you can have a website to share with the world. Read More like WordPress, there’s probably a specific way you’ll need to go about this. Because it differs in each content management system, you’ll need to consult the documentation for your system.

Some systems may have online interfaces for uploading your file, as well. For these, just copy and paste the file you created in the previous steps.

Remember to Update Your File

The last piece of advice I’ll give is to occasionally look over your robot exclusion file. Your site changes, and you may need to make some adjustments. If you notice a strange change in your search engine traffic, it’s a good idea to check out the file, too. It’s also possible that the standard notation could change in the future. Like everything else on your site, it’s worth checking up on it every once in a while.

Which pages do you exclude crawlers from on your site? Have you noticed any difference in search engine traffic? Share your advice and comments below!

Leave a Reply

Your email address will not be published. Required fields are marked *