How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)

web crawlerWeb Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links  – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.


Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:

Getting Started

We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:

 
<?php
include_once('simple_html_dom.php');
phpinfo();
?>

Access the file through your internet browser. If you don’t have a server set up, you can still run the program from my server if you want. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.

web crawler

The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.

<?php
include_once('simple_html_dom.php');

$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>

Again, you can run that from my server too if you don't have your own set up. You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you're linking to, but that's outside the scope of this tutorial.

If you're running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you'd like to examine.

That code was quite a jump from the last example, so let's go through in pseudo-code to make sure you understand what's going on.

Include once the simple HTML DOM helper file.

Set the target URL as http://www.tokyobit.com.

Create a new simple HTML DOM object to store the target page

Load our target URL into that object

For each link <a> that we find on the target page

- Print out the HREF attribute

That's it for today, but if you'd like a bit of challenge - try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you're interested in reading a part 2 (complete with homework solution!), or even if you'd like a back-basics PHP tutorial - and I'll rustle one up next time for you. I warn you though - once you get started with programming in PHP, you'll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.

Tagged:

James Bruce

James is a keen grow-your-own gardener and records the Technophilia Podcast weekly with Dave and Justin.

Similar Stuff

The comments were closed because the article is more than 90 days old.

If you have any questions related to stuff mentioned in the article or need help with any computer issue, just ask it on MakeUseOf Answers.

  • Jack Cola

    Welcome to MUO James, it was a great article. For the past 3 years, I wanted to learn PHP, but never got around to it. Hopefully, in 2011 I will eventually learn it.

    But do continue on part 2. It looks interesting when you continue on your examples. Ie example3.php, example4.php etc

    Make Use Of needs to make use of more coding examples :D

  • http://www.jackcola.org/ Jack Cola

    Welcome to MUO James, it was a great article. For the past 3 years, I wanted to learn PHP, but never got around to it. Hopefully, in 2011 I will eventually learn it.

    But do continue on part 2. It looks interesting when you continue on your examples. Ie example3.php, example4.php etc

    Make Use Of needs to make use of more coding examples :D

    • http://ipadboardgames.org James Bruce

      Thanks for the encouraging words, Jack, appreciated. You’re right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

      Do you have any other programming experience? PHP is easy because it’s untyped so no worries about casting, no memory mangement etc – it’s quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

      • http://www.jackcola.org/ Jack Cola

        Hi James,
        I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn’t get published either. But I can see why it programming articles don’t get through because its not part of their range of articles.

        One of my friends is working on a HTML manual for MUO, so its a start…

  • Dominic Chester

    PLEASE PLEASE PLEASE do a part 2.

  • James Bruce

    Done!

  • James Bruce

    Thanks for the encouraging words, Jack, appreciated. You’re right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

    Do you have any other programming experience? PHP is easy because it’s untyped so no worries about casting, no memory mangement etc – it’s quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

  • Dominic Chester

    PLEASE PLEASE PLEASE do a part 2.

    • http://ipadboardgames.org James Bruce

      Done!

  • Jack Cola

    Hi James,
    I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn’t get published either. But I can see why it programming articles don’t get through because its not part of their range of articles.

    One of my friends is working on a HTML manual for MUO, so its a start…

  • shan

    In the blog,
    “you will need a server to run PHP. You have a number of options here:”. I thought I had the first option – wordpress blog.

    Does it support php ? it looks like i cannot ftp.

    If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

  • James Bruce

    Hi Shan.

    Regarding WordPress – If you have a free blog with WordPress.com, then I’m afraid you can’t run your own PHP files on their server as it would be a security risk for them, and no there isn’t FTP access for that either. I should have made it clearer, but “self hosted” means running the WordPress system on your own website, for which you likely pay every month for. If you have your own domain name that doesnt have “wordpresss.com” in it, then you probably have it running on your own server.

    If it’s your own server, since you mentioned you cannot FTP into it, then you must have installed WordPress via Fantastico or Godaddy Hosting Connection easy install system, or by getting someone else to do it? You should still be able to FTP into your site… using any of FTP mentioned in the linked tutorials, just point them to “http://ftp.yourdomain.com” (replacing yourdomain, obviously), and using the password and username that your hosting company sent you originally. Also, if you can get access to your CPanel, there should be a java-based FTP manager in there that you could try to explore the FTP with. If you have no idea what I just said, don’t worry – you probably don’t own your own website, so you do option 3 to run a local test server.

    Regarding the differences between no1 and 3 -> 3 is possible to do without a remote server at all, you can simply setup a test server on your own machine that will allow you to run PHP. You don’t have to install wordpress on it, but since WordPress is PHP based, then that is also something you can install to test out if you want. No 1 is the easiest option for people who own their own website, and I only mentioned WordPress because if you’re running your own install of WordPress, then your server can definately do PHP.

    If you tell me your wordpress blog address or domain, ill be able to explain a little better…

  • shan

    In the blog,
    “you will need a server to run PHP. You have a number of options here:”. I thought I had the first option – wordpress blog.

    Does it support php ? it looks like i cannot ftp.

    If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

    • http://ipadboardgames.org James Bruce

      Hi Shan.

      Regarding WordPress – If you have a free blog with WordPress.com, then I’m afraid you can’t run your own PHP files on their server as it would be a security risk for them, and no there isn’t FTP access for that either. I should have made it clearer, but “self hosted” means running the WordPress system on your own website, for which you likely pay every month for. If you have your own domain name that doesnt have “wordpresss.com” in it, then you probably have it running on your own server.

      If it’s your own server, since you mentioned you cannot FTP into it, then you must have installed WordPress via Fantastico or Godaddy Hosting Connection easy install system, or by getting someone else to do it? You should still be able to FTP into your site… using any of FTP mentioned in the linked tutorials, just point them to “ftp.yourdomain.com” (replacing yourdomain, obviously), and using the password and username that your hosting company sent you originally. Also, if you can get access to your CPanel, there should be a java-based FTP manager in there that you could try to explore the FTP with. If you have no idea what I just said, don’t worry – you probably don’t own your own website, so you do option 3 to run a local test server.

      Regarding the differences between no1 and 3 -> 3 is possible to do without a remote server at all, you can simply setup a test server on your own machine that will allow you to run PHP. You don’t have to install wordpress on it, but since WordPress is PHP based, then that is also something you can install to test out if you want. No 1 is the easiest option for people who own their own website, and I only mentioned WordPress because if you’re running your own install of WordPress, then your server can definately do PHP.

      If you tell me your wordpress blog address or domain, ill be able to explain a little better…

  • Theprateeksharma

    I love MUO
    The only thing i missed in MUO was code
    now
    This new section has brought a smile on my face
    I want to demand even more
    THUMBS UP to MUO

  • Theprateeksharma

    I love MUO
    The only thing i missed in MUO was code
    now
    This new section has brought a smile on my face
    I want to demand even more
    THUMBS UP to MUO

  • Aibek

    excellent article James!

  • http://www.makeuseof.com/ Aibek

    excellent article James!

  • reuven

    Very nice indeed!
    thank you for the guide and code

  • reuven

    Very nice indeed!
    thank you for the guide and code

  • Drone

    C’mon this is Script-Kiddie stuff… go perl, a little lwp stuff and a sprinkling of regex. Who needs php (yuk) much less Google!? Or maybe you’re hiding behind Google – eh? If-so, you are spineless.

  • James Bruce

    lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

    Not sure what you mean about hiding behind google?

  • Drone

    C’mon this is Script-Kiddie stuff… go perl, a little lwp stuff and a sprinkling of regex. Who needs php (yuk) much less Google!? Or maybe you’re hiding behind Google – eh? If-so, you are spineless.

    • http://ipadboardgames.org James Bruce

      lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

      Not sure what you mean about hiding behind google?

  • JLG

    Really nice, and useful.

    Thank you!.

    JLG

  • JLG

    Really nice, and useful.

    Thank you!.

    JLG

  • Sid

    Good and useful ..
    i will try it out 2day .

    Thanks

  • Siddhartha

    Awesome man .. just tried it ..

  • Sid

    Good and useful ..
    i will try it out 2day .

    Thanks

  • Siddhartha

    Awesome man .. just tried it ..

  • Blipton

    There’s a book on spider bots, that can read/parse websites for you, but it uses and recommends curl.

    What are your thoughts on curl, and more importantly, do you have suggestion on how to setup a curl script so that it runs on your hosed site? I’ve tried numerous “curl supported” hosting sites, but have never been able to get a curl script to actually execute!

  • Blipton

    There’s a book on spider bots, that can read/parse websites for you, but it uses and recommends curl.

    What are your thoughts on curl, and more importantly, do you have suggestion on how to setup a curl script so that it runs on your hosed site? I’ve tried numerous “curl supported” hosting sites, but have never been able to get a curl script to actually execute!

  • JLG

    Hi!

    First, sorry for my english because i´ve never be a good student.

    I´ve tried to change the example1.php to make what i want. I´ll paste it, then i ask you:

    include_once(‘simple_html_dom.php’);
    /*$target_url = “http://www.twitter.com“;*/
    $name=$_GET['dominio'];
    $dom=$_GET['extension'];
    $target_url = “www.”.$dominio.”.”.$extension;
    $html = new simple_html_dom();
    $html->load_file($target_url);
    foreach($html->find(‘a’) as $link){
    echo “

    I want to read part2

  • Manjula_puttasetty

    I want to read part2

  • Strosdegoz

    I get this error when running example2.php

    Parse error: syntax error, unexpected ‘:’ in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

    Help please

  • Strosdegoz

    I get this error when running example2.php

    Parse error: syntax error, unexpected ‘:’ in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

    Help please

    • http://ipadboardgames.org James Bruce

      try a ; instead of a :

  • Brevillebov800xl

    Hi…………

    i found your site through search engine.i like your web
    page content and very attractive theme.really i like this
    it.

    Thanks.:)

  • frank754
  • James Bruce

    I think CURL might be needed and seems to be disabled by default on WAMP settings.

    To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

    Let us know if that helps

  • James Bruce

    try a ; instead of a :

  • frank754

    Can’t get it to work, I’m using WAMP and PHP. Get an internal 500 error. Do certain settings need to be enabled on PHP?
    I’m really trying hard to find a way to crawl the images on my site in the tags, I tried Sphider and tried to modify that to no avail yet (it just returns urls)
    My site (viewoftheblue.com/photography) has nearly 3000 images and I need to somehow crawl it and get a list of images and also know what subdirectory the images are in, or else just a list of all the image files showing the path. Thanks

    • http://ipadboardgames.org James Bruce

      I think CURL might be needed and seems to be disabled by default on WAMP settings.

      To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

      Let us know if that helps

  • frank754

    OK – I thought I posted a comment here earlier, but maybe not. I said I couldn’t get it to work. Anyway, I finally figured out a cool way to crawl a page for links and then get the images from those links in list form. A bit crude, but it works. I had to use a base url also, to find the the directories for the images.
    include(‘simple_html_dom.php’);

    // Create DOM from URL
    $page = “http://viewoftheblue.com/photo…“;
    $html = new simple_html_dom();
    $html->load_file($page);

    // Find all links
    $links = array();
    foreach($html->find(‘a’) as $element) {
    $links[] = $element;
    }
    reset($links);
    echo “Links found on $page:

    “;
    foreach ($links as $out) {
    echo “$out->href
    “;
    }
    echo”
    “;

    // Parse resultant individual pages for images

    foreach ($links as $subpage) {
    $base = “http://viewoftheblue.com/photo…“;
    // Create DOM from URL
    $subpage = $subpage->href;
    $page = $base . $subpage;
    $html = file_get_html($page);

    // Find all images
    $images = array();
    foreach($html->find(‘img’) as $element) {
    $images[] = $element->src;
    }
    reset($images);
    echo “Images found on $page:

    “;
    foreach ($images as $out) {
    $url = “$out“;
    echo “$url
    “;
    }
    echo”
    “;
    }
    ?>

  • frank754

    OK – I thought I posted a comment here earlier, but maybe not. I said I couldn’t get it to work. Anyway, I finally figured out a cool way to crawl a page for links and then get the images from those links in list form. A bit crude, but it works. I had to use a base url also, to find the the directories for the images.
    load_file($page);

    // Find all links
    $links = array();
    foreach($html->find(‘a’) as $element) {
    $links[] = $element;
    }
    reset($links);
    echo “Links found on $page:”;
    foreach ($links as $out) {
    echo “$out->href”;
    }
    echo”";

    // Parse resultant individual pages for images

    foreach ($links as $subpage) {
    $base = “http://viewoftheblue.com/photography/”;
    // Create DOM from URL
    $subpage = $subpage->href;
    $page = $base . $subpage;
    $html = file_get_html($page);

    // Find all images
    $images = array();
    foreach($html->find(‘img’) as $element) {
    $images[] = $element->src;
    }
    reset($images);
    echo “Images found on $page:”;
    foreach ($images as $out) {
    $url = “$out“;
    echo “$url”;
    }
    echo”";
    }
    ?>