Pinterest Stumbleupon Whatsapp
Ads by Google

Are you a believer in the idea that once something is published on the Internet, it’s published forever? Well, today we’re going to dispel that myth.

The truth is that in many cases it’s quite possible to eradicate information from the Internet. Sure, there’s a record of web pages that have been deleted if you search the Wayback Machine, right? Yup, absolutely. On the Wayback Machine there are records of web pages going back many years — pages that you won’t find with a Google search because the web page no longer exists. Someone deleted it, or the website got shut down.

So, there’s no getting around it, right? Information will forever be engraved into the stone of the Internet, there for generations to see? Well, not exactly.

The truth is that while it might be difficult or impossible to wipe out major news stories that have proliferated from one news website or blog to another like a virus, it is actually quite easy to completely eradicate a web page or several web pages from all records of existence — to remove that page for both search engines as well as the Wayback Machine The New Wayback Machine Lets You Visually Travel Back In Internet Time The New Wayback Machine Lets You Visually Travel Back In Internet Time It seems that since the Wayback Machine launch in 2001, the site owners have decided to toss out the Alexa-based back-end and redesign it with their own open source code. After conducting tests with the... Read More .  There is a catch of course, but we’ll get to that.

3 Ways to Remove Blog Pages From the Net

The first method is the one that majority of website owners use, because they don’t know any better — simply deleting web pages. This might happen because you’ve realized that you have duplicate content on your site, or because you have a page that you don’t want to show up in search results.

Simply Delete the Page

The problem with entirely deleting pages from your website is that since you’ve already established the page on the net, there are likely to be links from your own site as well as external links from other sites to that particular page. When you delete it, Google immediately recognizes that page of yours as a missing page.

Ads by Google

file-not-found

So, in deleting your page you’ve not only created an issue with “Not found” crawl errors for yourself, but you’ve also created a problem for anyone who ever linked to the page.   Usually, users that get to your site from one of those external links will see your 404 page, which isn’t a major problem, if you use something like Google’s custom 404 code to give users helpful suggestions or alternatives. But, you’d think there could be more graceful ways of deleting pages from search results without kicking off all of those 404’s for existing incoming links, right?

Well, there are.

Remove a Page From Google Search Results

First of all, you should understand that if the web page you want to remove from Google search results isn’t a page from your own site, then you’re out of luck unless there are legal reasons or if the site has posted your personal information online without your permission. If that’s the case, then use Google’s removal troubleshooter to submit a request to have the page removed from search results. If you have a valid case, your may find some success having the page removed — of course you might have even greater success just contacting the website owner How to Remove False Information about Yourself on Internet How to Remove False Information about Yourself on Internet Read More as I described how to do back in 2009.

Now, if the page you want to remove from search results is on your own site, you’re in luck.  All you need to do is create a robots.txt file and make sure that you’ve disallowed either the specific page you don’t want in the search results, or the entire directory with the contents that you don’t want indexed. Here’s what blocking a single page looks like.

User-agent: *
Disallow: /my-deleted-article-that-i-want-removed.html

You can block bots from crawling entire directories of your site as follows.

User-agent: *
Disallow: /content-about-personal-stuff/

Google has an excellent support page that can help you create a robots.txt file if you’ve never created one before.  This works extremely well, as I explained recently in an article about structuring syndication deals How To Negotiate Syndication Deals And Protect Your Search Rankings How To Negotiate Syndication Deals And Protect Your Search Rankings Syndicating is all the rage these days. But suddenly you could find that the syndication partner is listed higher than you in search results for a story that you originally wrote! Protect your search rankings. Read More so that they don’t hurt you (asking syndication partners to disallow indexing of their pages where you are syndicated). Once my own syndication partner agreed to do this, the pages that were duplicated content from my blog completely disappeared from search listings.

google-listings

Only the main website comes up at third place for the page where they list our title, but my blog is now listed at both the first and second spots; something that would have been nearly impossible had a higher-authority website left the duplicated page indexed.

What many people don’t realize is that this is also possible to accomplish with the Internet Archive (the Wayback Machine) as well. Here are the lines you need to add to your robots.txt file to make it happen.

User-agent: ia_archiver
Disallow: /sample-category/

In this example, I’m telling the Internet Archive to remove anything in the sample-category subdirectory on my site from the Wayback Machine. The Internet archive explains how to do this on their Exclusion help page.  This is also where they explain that “The Internet Archive is not interested in offering access to web sites or other Internet documents whose authors do not want their materials in the collection.”

This flies contrary to the commonly-held belief that anything posted to the Internet gets swept up into the archive for all eternity. Nope – webmasters that own the content can specifically have the content removed from the archive by using the robots.txt approach.

Remove an Individual Page With Meta Tags

If you only have a few individual pages that you want to remove from Google Search results, you actually don’t have to use the robots.txt approach at all, you could simply add the correct “robots” meta tag to the individual pages, and tell the robots not to index or follow links on the entire page.

external-links3

You could use the “robots” meta above to stop robots from indexing the page, or you could specifically tell the Google robot not to index so the page is only removed from Google search results, and other search robots could still access the page content.

<meta name="googlebot" content="noindex" />

It’s completely up to you how you’d like to manage what robots do with the page and whether or not the page gets listed. For just a few individual pages, this may be the better approach. To remove an entire directory of content, go with the robots.txt method.

The Idea of “Removing” Content

This sort of turns the whole notion of “deleting content from the Internet” on its head. Technically, if you remove all of your own links to a page on your site, and you remove it from Google Search and the Internet Archive using the robots.txt technique, the page is for all intents and purposes “deleted” from the Internet. The cool thing though is that if there are existing links to the page, those links will still work and  you won’t trigger 404 errors for those visitors.

It’s a more “gentle” approach to removing content from the Internet without entirely messing up your site’s existing link popularity throughout the Internet. In the end, how you go about managing what content gets collected by search engines and the Internet Archive is up to you, but always remember that despite what people say about the lifespan of things that get posted online, it really is completely within your control.

  1. mohit
    April 23, 2015 at 7:07 pm

    I want to remove the blog Page from google and I don't have the login details,so can you let me know if I can still do that.

    Regards

  2. Guest
    January 24, 2014 at 1:46 am

    Ryan, does this work with Google Blogger as well?

    • Ryan Dube
      January 24, 2014 at 1:48 am

      If you can edit the header section, then yes. I haven't used Blogger in ages, so not sure what the current state of the editor there is these days....

  3. George Iov
    January 8, 2014 at 4:27 pm

    Thanks for focusing this problem. For me, the bottom line is that we should think twice before publishing content online. Inbound links are the problem!

  4. Ryan Dube
    January 5, 2014 at 5:24 pm

    Hey Henk - that's unfortunate. Didn't realize you can't edit the header section of a page when hosting on WordPress.org...that's kind of pathetic actually. I would highly recommend migrating your blog to a self-hosted platform so you're not so limited. It's really not all that hard....

    You're right though, if you can't edit the header section of the web page, then you're out of luck when it comes to removing your own page from Google search results.

  5. Henk van Setten
    January 4, 2014 at 12:44 pm

    Problem is, solutions 2 and 3 won't work at all for people like me who have have a blog hosted at WordPress.com.
    You see, WordPress doesn't allow this kind of editing of the robots.txt, nor does it allow setting up meta tags for specific pages within the blog.
    It's a pity!

    • Ryan Dube
      January 4, 2014 at 5:54 pm

      Solution #3 (removing an individual page with meta tags) will work for you, since you only need to add that meta tag to the HTML of the page that you want to remove from Google.

    • Henk van Setten
      January 4, 2014 at 9:54 pm

      I'm sorry Ryan, but I fear you really are wrong here. I have a blog (stayontop.org) that is hosted on wordpress.com and uses the premium (paid) wordpress maintenance and customization facilities.

      The edit box for individual pages (part of the dashboard pages) only allows to enter/edit code that falls within the BODY tags, so not the full HTML such as all the headers and style tags that come before the BODY tag. That first part of the HTML is always generated automatically by WordPress, uniformly for all pages (though partly based on user-set preferences) and there is no way to edit that after the fact for one particular page. So it's just not possible to insert something like an extra header line (just like it isn't possible to add a javascript, btw, but that's another subject).

      If you know of a way to circumvent this WordPress limitation, I certainly would be happy to hear how it's done!

  6. Tom S
    January 4, 2014 at 6:04 am

    This is very helpful and useful information.

Leave a Reply

Your email address will not be published. Required fields are marked *