How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)

Ads by Google

web crawlerWeb Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links  – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.


Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:

Getting Started

We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:

    <?php  include_once('simple_html_dom.php');  phpinfo();  ?>  

Access the file through your internet browser. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.

web crawler

Ads by Google

The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.

  <?php  include_once('simple_html_dom.php');

$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>

You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.

If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.

That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.

Include once the simple HTML DOM helper file.

Set the target URL as http://www.tokyobit.com.

Create a new simple HTML DOM object to store the target page

Load our target URL into that object

For each link <a> that we find on the target page

– Print out the HREF attribute

That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you’re interested in reading a part 2 (complete with homework solution!), or even if you’d like a back-basics PHP tutorial – and I’ll rustle one up next time for you. I warn you though – once you get started with programming in PHP, you’ll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.

Ads by Google
Comments (40)
  • Jon

    This is a really good article. I am an investor of Mozenda and I think it is great that you help people learn what is possible with all of the data out there on the Internet.

    Thanks.

  • Simon

    The php errors mentioned by others above are probably because the double and single quotes that you copied from this example are not recognized by your server’s php editor / compiler. After you copy the code into your own php file, locate each double quote, delete it, and reinsert using your computer’s keyboard. Do the same for the single quotes. I guess it depends on your keyboard language setting, which gives you a different variant of single and double quotes.

    The double quotes in this example are ” …….. my server’s php wants to see ” A very subtle change. HTH.

  • Aki

    Exact same question as bombay, what if you want the content, the inside of the content
    ?

  • bombay

    If someone wants to search content of the website ?

  • AppaCyber Net

    tHIS is what I’m looking for … I wanna have a try with it then. thanks for the tutorial

Load 10 more
Affiliate Disclamer

This review may contain affiliate links, which pays us a small compensation if you do decide to make a purchase based on our recommendation. Our judgement is in no way biased, and our recommendations are always based on the merits of the items.

For more details, please read our disclosure.
Affiliate Disclamer

This review may contain affiliate links, which pays us a small compensation if you do decide to make a purchase based on our recommendation. Our judgement is in no way biased, and our recommendations are always based on the merits of the items.

For more details, please read our disclosure.