Pinterest Stumbleupon Whatsapp
Ads by Google

build a webcrawlerThis is part 2 in a series I started last time about how to build a web crawler in PHP. Previously I introduced the Simple HTML DOM helper file, as well as showing you how incredibly simple it was to grab all the links from a webpage, a common task for search engines like Google.

If you read part 1 and followed along How To Build A Basic Web Crawler To Pull Information From A Website (Part 1) How To Build A Basic Web Crawler To Pull Information From A Website (Part 1) Read More , you’ll know I set some homework to adjust the script to grab images instead of links.


I dropped some pretty big hints, but if you didn’t get it or if you couldn’t get your code to run right, then here is the solution. I added an additional line to output the actual images themselves as well, rather than just the source address of the image.

<?php
include_once('simple_html_dom.php');
$target_url = "http://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>

This should output something like this:

build a webcrawler

Of course, the results are far from elegant, but it does work. Notice that the script is only capable of grabbing images that are on the content of the page in the form of <img> tags – a lot of the page design elements are hard-coded into the CSS, so our script can’t grab those. Again, you can run this through my server and if you wish at this URL, but to enter your own target site you’ll have to edit the code and run on your own server as I explained in part 1. At this point, you should bear in mind that downloading images from a website is significantly more stress on the server than simply grabbing text links, so do only try the script on your own blog or mine and try not to refresh lots of times.

Ads by Google

Let’s move on and be a little more adventurous. We’re going to build upon our original file, and instead of just grabbing all the links randomly, we’re going to make it do something more useful by getting the post content instead. We can do this quite easily because standard WordPress wraps the post content within a <div class=”post”> tag, so all we need to do is grab any “div” with that class type, and output them – effectively stripping everything except the main content out of the original site. Here is our initial code:

<?php
include_once('simple_html_dom.php');
$target_url = "http://www.tokyobit.com";

$html = new simple_html_dom();

$html->load_file($target_url);
foreach($html->find(‘div[class=post]’) as $post)
{
echo $post.”<br />”;
}

?>

You can see the output by running the script from here (forgive the slowness, my site is hosted at GoDaddy and they don’t scale very well at all), but it doesn’t contain any of the original design – it is literally just the content.

Let me show you another cool feature now – the ability to delete elements of the page that we don’t like. For instance, I find the meta data quite annoying – like the date and author name – so I’ve added some more code that finds those bits (identified by various classes of div such as post-date, post-info, and meta). I’ve also added a simple CSS style-sheet to format the output a little. Daniel covered a number of great places to learn CSS online Top 5 Sites To Learn CSS Online Top 5 Sites To Learn CSS Online Read More if you’re not familiar with it.

As I mentioned in part 1, even though the file contains PHP code, we can still add standard HTML or CSS to the page and the browser will understand it just fine – the PHP code is run on the server, then everything is sent to the browser, to you, as standard HTML. Anyway, here’s the whole final code:


<head>
<style type=”text/css”>
div.post{background-color: gray;border-radius: 10px;-moz-border-radius: 10px;padding:20px;}
img{float:left;border:0px;padding-right: 10px;padding-bottom: 10px;}
body{width:60%;font-family: verdana,tahamo,sans-serif;margin-left:20%;}
a{text-decoration:none;color:lime;}
</style>
</head>

<?php
include_once(‘simple_html_dom.php’);

$target_url = “http://www.tokyobit.com”;

$html = new simple_html_dom();

$html->load_file($target_url);
foreach($html->find(‘div[class=post]’) as $post)
{
$post->find(‘div[class=post-date]’,0)->outertext = ”;
$post->find(‘div[class=post-info]’,0)->outertext = ”;
$post->find(‘div[class=meta]’,0)->outertext = ”;
echo $post.”<br />”;
}

?>

You can check out the results here. Pretty impressive, huh? We’ve taken the content of the original page, got rid of a few bits we didn’t want, and completely reformatted it in the style we like! And more than that, the process is now automated, so if new content were to be published, it would automatically display on our script.

build a webcrawler

That’s only a fraction of the power available to you though, you can read the full manual online here if you’d like to explore it a little more of the PHP Simple DOM helper and how it greatly aids and simplifies the web crawling process. It’s a great way to take your knowledge of basic HTML and take it up to the next dynamic level.

What could you use this for though? Well, let’s say you own lots of websites and wanted to gather all the contents onto a single site. You could copy and paste the contents every time you update each site, or you could just do it all automatically with this script. Personally, even though I may never use it, I found the script to be a useful exercise in understanding the underlying structure of modern internet documents. It also exposes how simple it is to re-use content when everything is published on a similar system using the same semantics.

What do you think? Again, do let me know in the comments if you’d like to learn some more basic web programming, as I feel like I’ve started you off on level 5 and skipped the first 4! Did you follow along and try yourself, or did you find it a little too confusing? Would you like to learn more about some of the other technologies behind the modern internet browsing experience?

If you’d prefer learning to program on the desktop side of things, Bakari covered some great beginner resources for learning Cocoa Mac OSX desktop programming 6 Beginner Resources For Learning Mac Programming 6 Beginner Resources For Learning Mac Programming Read More at the start of the year, and our featured directory app CodeFetch is useful for any programming language. Remember, skills you develop programming in any language can be used across the board.

  1. Alex Jenkins
    December 3, 2016 at 6:57 pm

    Is there a way to use a list of target_urls instead of specifying a single url to crawl?

  2. Prasain
    December 10, 2014 at 9:58 am

    I need to make simple web crawler using PHP. In this program, user defines a URL and it searches all links, and records response time only. Every helpful suggestions are welcomed. Thankx in advance

  3. Chris Horsnell
    January 11, 2011 at 11:08 am

    interesting article, however I would be more interested in less documentation of simplehtmldom, and more information on general web crawler concepts. i am writing a crawler bot in PHP at the mo and am looking for a nice efficient way of crawling and looping pages

    currently its manually initiated but I have been considering tweaking it to run as a CRON, and how that would effect the way the script works and how many pages it parses per instance of the code

    • James Bruce
      January 11, 2011 at 11:11 am

      That's a valid point Chris, but I must admit I was trying to focus the article at very much a beginner level and Simple HTML DOM seemed like the best way to explain along with real world practical example. I will look into more general crawler concepts for a future article though, perhaps more conceptual than code-based?

      • Chris Horsnell
        January 11, 2011 at 11:41 am

        Sounds good, I have looked all over and although I admit that PHP probably isnt the best language for the job, for my circumstances its what I require, and finding suitable articles on the guts of a web crawler rather than the scraping side is proving difficult.

        If you wish to discuss any ideas let me know :)

  4. Chris Horsnell
    January 11, 2011 at 12:41 pm

    Sounds good, I have looked all over and although I admit that PHP probably isnt the best language for the job, for my circumstances its what I require, and finding suitable articles on the guts of a web crawler rather than the scraping side is proving difficult.

    If you wish to discuss any ideas let me know :)

  5. Chris Horsnell
    January 11, 2011 at 12:08 pm

    interesting article, however I would be more interested in less documentation of simplehtmldom, and more information on general web crawler concepts. i am writing a crawler bot in PHP at the mo and am looking for a nice efficient way of crawling and looping pages

    currently its manually initiated but I have been considering tweaking it to run as a CRON, and how that would effect the way the script works and how many pages it parses per instance of the code

  6. James Bruce
    January 6, 2011 at 9:57 am

    apologies, looks like wordpress was messing with the link. try this one:

    http://tokyobit.com/tutorial/c...

    you'll need to download the source code of the file as there is no way to get the php file otherwise without executing it first. just download the new zip link above, unzip and upload the php contained within.

    certainly, finding a decent host is a difficult task. in the past, ive liked the fact that godaddy just lets you host millions of domains with unlimited space, but now i'm developing more i'm really starting to see the limitations. soon as I get a spare moment, ill be moving my sites over to MediaTemple, whom I work with daily for my day job - a simple virtual server where you can set the server up as you like, and it's incredibly fast. Budget hosts like godaddy are unfortunately exactly that - budget. If you can justify the extra cost, a virtual server is the only way to go.

  7. James Bruce
    January 6, 2011 at 9:51 am

    that would take a few more steps, in that you would have to trawl the whole site for any css files, then extract the image links from those. Not impossible though, but if you want to do is suck everything from a certain website, a simple application like site sucker is certainly more adept to the task. Crawling and scraping is more about taking certain content and reformatting it for your nefarious usage.

    http://www.sitesucker.us/home....

  8. Blipton
    January 6, 2011 at 6:54 am

    Thanks, i wasn't able to download the source but the godaddy example works fine!

    Was there any special GoDaddy configuration that you needed to do to execute it?

    I'd love to try this.. I just created an account with FreeHostia, which supposedly has curl support. If the 'curdemo.php.zip' source is not available anymore, could I upload just the php file and see if it works?

    Getting the configuration for curl and a given host provider to play nicely seems to be the tricky part... it's as if the host provider does not allow curl to execute and/or write files to the local (host) directory where the php script lives!

  9. Strosdegoz
    January 5, 2011 at 9:40 am

    What about a way to get all the images including those hard-coded?
    For example a way to get every jpg, png, etc files link...

    • James Bruce
      January 6, 2011 at 8:51 am

      that would take a few more steps, in that you would have to trawl the whole site for any css files, then extract the image links from those. Not impossible though, but if you want to do is suck everything from a certain website, a simple application like site sucker is certainly more adept to the task. Crawling and scraping is more about taking certain content and reformatting it for your nefarious usage.

      http://www.sitesucker.us/home.html

  10. Strosdegoz
    January 5, 2011 at 10:40 am

    What about a way to get all the images including those hard-coded?
    For example a way to get every jpg, png, etc files link...

  11. SearchBlox
    January 2, 2011 at 4:27 pm

    You can also try using SearchBlox (http://www.searchblox.com/) for a crawling websites.

  12. SearchBlox
    January 2, 2011 at 3:27 pm

    You can also try using SearchBlox (http://www.searchblox.com/) for a crawling websites.

  13. Siddhartha
    December 24, 2010 at 5:02 pm

    No that was not the problem .. anyways i got it working ..
    Here's the code :
    [code]

    <style type="”text/css”">
    div.post{background-color: gray;border-radius: 10px;-moz-border-radius: 10px;padding:20px;}
    img{float:left;border:0px;padding-right: 10px;padding-bottom: 10px;}
    body{width:60%;font-family: verdana,tahamo,sans-serif;margin-left:20%;}
    a{text-decoration:none;color:lime;}
    </style>

    include_once('simple_html_dom.php');
    $target_url = "http://www.tokyobit.com";
    $html = new simple_html_dom();
    $html->load_file($target_url);
    foreach($html->find('div[class=post]') as $post)
    {
    $post->find('div[class=post-date]',0)->outertext = '';
    $post->find('div[class=post-info]',0)->outertext = '';
    $post->find('div[class=meta]',0)->outertext = '';
    echo $post;
    }
    ?>
    [/code]

  14. James Bruce
    December 21, 2010 at 11:18 am

    Never tried CURL myself, but GoDaddy claims to support CURL fully.

    I've upload a simple demo script here so you can see it does indeed work on my host, Godaddy. It just checks if a domain name is available or not.

    http://www.tokyobit.com/tutori...

    you can get the sourcecode for that little demo here:

    http://www.tokyobit.com/tutori...

    cURL is certainly the more advanced way to do all this, but having just gotten started on scraping I opted for the easier option of Simple HTML DOM instead with default read methods.

    You can use SimpleHTML DOM as well as cURL, by first reading the entire page then simplly feeding the text into SimpleHTML DOM with str_get_dom($var)

  15. James Bruce
    December 21, 2010 at 11:01 am

    Hi Colside, I'm afraid I haven't tried the script on a local Windows machine yet, there may be some specifics to running on Windows that I'm not aware of.

    Hmm, looking at the formatting the site has done to the code, it may be related to this part:

    include_once(‘simple_html_dom.php’);

    Change those single apostrophes to the normal ' type, and that might help. Sorry, MUO is not set up to handle code segments correctly it seems.

  16. Blipton
    December 21, 2010 at 4:55 am

    There's a book on spider bots, that can read/parse websites for you, but it uses and recommends curl.

    What are your thoughts on curl, and more importantly, do you have suggestion on how to setup a curl script so that it runs on your hosed site? I've tried numerous "curl supported" hosting sites, but have never been able to get a curl script to actually execute!

    • James Bruce
      December 21, 2010 at 10:18 am

      Never tried CURL myself, but GoDaddy claims to support CURL fully.

      I've upload a simple demo script here so you can see it does indeed work on my host, Godaddy. It just checks if a domain name is available or not.

      http://www.tokyobit.com/tutorial/curldemo.php

      you can get the sourcecode for that little demo here:

      http://www.tokyobit.com/tutorial/curldemo.php.zip

      cURL is certainly the more advanced way to do all this, but having just gotten started on scraping I opted for the easier option of Simple HTML DOM instead with default read methods.

      You can use SimpleHTML DOM as well as cURL, by first reading the entire page then simplly feeding the text into SimpleHTML DOM with str_get_dom($var)

      • Blipton
        January 6, 2011 at 5:54 am

        Thanks, i wasn't able to download the source but the godaddy example works fine!

        Was there any special GoDaddy configuration that you needed to do to execute it?

        I'd love to try this.. I just created an account with FreeHostia, which supposedly has curl support. If the 'curdemo.php.zip' source is not available anymore, could I upload just the php file and see if it works?

        Getting the configuration for curl and a given host provider to play nicely seems to be the tricky part... it's as if the host provider does not allow curl to execute and/or write files to the local (host) directory where the php script lives!

        • James Bruce
          January 6, 2011 at 8:57 am

          apologies, looks like wordpress was messing with the link. try this one:

          http://tokyobit.com/tutorial/curldemo.zip

          you'll need to download the source code of the file as there is no way to get the php file otherwise without executing it first. just download the new zip link above, unzip and upload the php contained within.

          certainly, finding a decent host is a difficult task. in the past, ive liked the fact that godaddy just lets you host millions of domains with unlimited space, but now i'm developing more i'm really starting to see the limitations. soon as I get a spare moment, ill be moving my sites over to MediaTemple, whom I work with daily for my day job - a simple virtual server where you can set the server up as you like, and it's incredibly fast. Budget hosts like godaddy are unfortunately exactly that - budget. If you can justify the extra cost, a virtual server is the only way to go.

  17. Theprateeksharma
    December 19, 2010 at 6:44 pm

    Thanks for this great series !!!!!!!!!
    waiting for more!!!!!!!

  18. Theprateeksharma
    December 19, 2010 at 7:44 pm

    Thanks for this great series !!!!!!!!!
    waiting for more!!!!!!!

  19. Coolsid8
    December 19, 2010 at 1:34 pm

    Doesn't work ..
    Parse error: parse error in C:wampwwwcrawler.php on line 11

    • James Bruce
      December 21, 2010 at 10:01 am

      Hi Colside, I'm afraid I haven't tried the script on a local Windows machine yet, there may be some specifics to running on Windows that I'm not aware of.

      Hmm, looking at the formatting the site has done to the code, it may be related to this part:

      include_once(‘simple_html_dom.php’);

      Change those single apostrophes to the normal ' type, and that might help. Sorry, MUO is not set up to handle code segments correctly it seems.

  20. Coolsid8
    December 19, 2010 at 2:34 pm

    Doesn't work ..
    Parse error: parse error in C:\wamp\www\crawler.php on line 11

  21. guest
    December 18, 2010 at 1:01 am

    Just stumbled by, but very informative. Thank you!

  22. John Arleth
    December 17, 2010 at 8:42 pm

    I would like learning more. I do think a basic tutorial on html would be helpful, too.

Leave a Reply

Your email address will not be published. Required fields are marked *