Pinterest Stumbleupon Whatsapp
Ads by Google

web crawlerWeb Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links  – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.


Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:

Ads by Google

Getting Started

We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:


<?php
include_once('simple_html_dom.php');
phpinfo();
?>

Access the file through your internet browser. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.

web crawler

The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.

<?php
include_once('simple_html_dom.php');

$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>

You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you’re linking to, but that’s outside the scope of this tutorial.

If you’re running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you’d like to examine.

That code was quite a jump from the last example, so let’s go through in pseudo-code to make sure you understand what’s going on.

Include once the simple HTML DOM helper file.

Set the target URL as http://www.tokyobit.com.

Create a new simple HTML DOM object to store the target page

Load our target URL into that object

For each link <a> that we find on the target page

– Print out the HREF attribute

That’s it for today, but if you’d like a bit of challenge – try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you’re interested in reading a part 2 (complete with homework solution!), or even if you’d like a back-basics PHP tutorial – and I’ll rustle one up next time for you. I warn you though – once you get started with programming in PHP, you’ll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.

  1. Arkar
    September 15, 2016 at 9:44 am

    How do i list down ALL THE LINKS on a page? Including any of the LINKS inside the CSS, JS files, everywhere.

    Let's say, inside a CSS file:
    background: url("../images/banner.jpg");

    Also, inside a JS file:
    $("div.wrapper").css('background-image','url(../images/background.png)');

    How do i detect & listdown ALL KINDS of such links, linked items on a page?

    THANKS ALL!
    BIG RESPECTS!

  2. Coder64
    August 16, 2016 at 1:02 pm

    That code didnt work for me :(

    • James Bruce
      August 16, 2016 at 1:08 pm

      You should probably be more specific if you want help.

    • Coder64
      August 16, 2016 at 1:15 pm

      I have copied all example code, but php giving me errors and nothing happens. It says that there is undefined variable target_url and giving warning that file_get_contents() can not be empty

    • Coder64
      August 16, 2016 at 1:22 pm

      I fixed it allready!

  3. Siddharth Bharat Vitthaldas
    July 1, 2016 at 6:46 am

    Hey, This is working fine for me. I just need one more thing. Can we make this crawler crawl my whole website for links when i only provide the homepage link?

    That would be of great help.

    • Valentin Pearce
      July 7, 2016 at 3:32 am

      Hi,

      I'm trying to build a lightweight crawler that would do exactly that in order to build an sitemap that I could update whenever I run the crawler.

      I know it's hardly been a week since you asked but have you found anything ?

    • Siddharth Bharat Vitthaldas
      July 7, 2016 at 3:59 am

      No luck yet! I am using the same code on all the results returned in the first crawl. But it is not light weight at all.

    • Valentin Pearce
      July 7, 2016 at 6:07 am

      That's what I'm trying to do at the moment and I need to set up my RegEx to only crawl in my website and not the whole internet of course but the results seem incoherent with some links showing up as this :

      /example/page1

      and others showing up as this :

      http://sub.mydomain.com/otherexample/page42

      It's a pain to have to check this :/

  4. blake
    June 13, 2016 at 7:06 am

    Parse error: syntax error, unexpected ':' in /home/wknoblock/felixexi.com/example1.php on line 3

    • James Bruce
      June 13, 2016 at 7:18 am

      That should be a semi colon, not a colon.

    • blake
      June 13, 2016 at 7:02 pm

      load_file($target_url);
      foreach($html->find('a') as $link)
      {
      echo $link->href."";
      }
      ?>

      this is what I have- still shows that error

    • David
      July 27, 2016 at 4:53 am

      I had the same issue, and if like me you copied and pasted you are being undone by the quotes around the link. Delete the double quotes before and after the web address and replace with your own double or single quotes and it should fix it. If using a text editor like Brackets you should see that code change colour after you put in your own quote marks.

  5. David
    May 1, 2016 at 9:11 pm

    Hi, I am trying to run web crawler on my machine. I am getting below error
    The localhost page isn’t working.
    localhost is currently unable to handle this request.
    I have wordpress installed locally.

  6. Rudy
    April 20, 2016 at 3:15 am

    hi mr. james, it is me again. I have already done to emulate a browser visiting using PHP curl like you suggest. However, I still can't interact with the page and grab the value I want. Some people on the net told me that I should look for the ajax link contain the value I want. So using inspect element, I found that link in JS tab which contain all JSON value I want. My question is, how do I request this links using this web emulating and PHP script? I don't want to copy the link manually like some people suggest, I want the script whom doing it.

    Here it is the website I want so crawl. You can see the product page is the Ajax,
    https://www.tokopedia.com/p/dapur/peralatan-masak/kompor

    you can also see in the JS tab using the inspect element the JSON value, but I will copy the link of the site above for you:
    https://ace.tokopedia.com/search/v1/product?full_domain=www.tokopedia.com&scheme=https&device=desktop&source=directory&page=1&fshop=1&rows=24&sc=912&start=0&callback=angular.callbacks._0

    So any solutions?

  7. Rudy
    April 17, 2016 at 7:39 am

    hi, can simple_html_dom or PHPCrawl crawl a dynamic ajax or javascript content? if it is can, can you show me how to do that? I tried to combine this two methods and works for several websites, but when I tried to this two dynamic websites and I can't load the value I want.
    The value exist when I inspect the element, but when I view page source, the value is not in there.

    • James Bruce
      April 17, 2016 at 8:32 am

      Nope, that won't work. That' generated in your browser, so you would need to emulate a browser visiting and interact with a page, rather than simply fetching the content.

    • Rudy
      April 17, 2016 at 10:24 am

      Thank you for your suggestion, I will try to find how to emulate a browser visiting first.
      But if I hope you can give me another suggestion again if there is a problem again.

  8. Brijesh
    December 28, 2015 at 7:36 am

    yes it's so helpful. i want to know how get image,titile and description from these url's

  9. n0d3
    December 24, 2015 at 2:39 am

    wondering if its possible to change the code to search for a perticular model of laptop to be available for sale anywhere, i.e. x201s lenovo thinkpad as an example

  10. Awosiyan Peter
    November 15, 2015 at 8:39 pm

    you'll use $link->innertext @wine_81

  11. Ali Lotfi
    October 14, 2015 at 10:31 am

    Very Good.
    It was Very Helpful For Me
    Tnx.

  12. Eliz Clair
    October 5, 2015 at 9:04 pm

    Thank you so much. It's so helpful :))))) Thank you for your hard work!

  13. Tachibana Hotaru
    September 15, 2015 at 3:56 am

    if you want to get content of website, just do GET from the website and it will return the HTML, then you need to parse it, in java you can use jsoup to parse the syntax, or you can use simple regex to get the data.

  14. Tachibana Hotaru
    September 15, 2015 at 3:48 am

    In fact you can easily do POST/GET (post data to web/browse page) using ScrapperMin found in google Play Store

    Play Store Llink : https://play.google.com/store/apps/details?id=com.innosia.scrappermin

    Sample :
    WC_GetPage('http://www.innosia.com/transform', 'GET', '', '');

    It will auto crawl innosia.com and get the data, you can parse the data by saving the output to variable and start doing parsing.

    SET('DATA', WC_GetPage('http://www.innosia.com/transform', 'GET', '', ''));
    SO_SingleTagMatch(GET('DATA'), 'name="__RequestVerificationToken"', 'value="', '"');

    And it will output the security token of the website's form, then post back the data with the security token to website and the web will automatically think you are a human submitting data.

    Full Syntaax Guide : http://www.scrappermin.com/Content/latest/scrappermin/scrappermin-guide.pdf

  15. wine_81
    September 10, 2015 at 10:21 am

    Hi Mr. James Bruce,

    First of all, thank you very much for sharing this piece of knowledge, it is very useful.
    I have a question though, how do I get the text property of an anchor element ? I tried echo $link->text, but it shows nothing...

    (i.e an anchor element This is a link

    How do I get the text "This is a link" using this web crawler ?

    Thank you very much for your time and help.
    Edwin

  16. Daniel Carneiro
    July 19, 2015 at 2:22 am

    Exactly what I am looking for. Congratulations!

  17. Jon
    April 30, 2015 at 11:45 am

    This is a really good article. I am an investor of Mozenda and I think it is great that you help people learn what is possible with all of the data out there on the Internet.

    Thanks.

  18. Simon
    April 21, 2015 at 4:13 pm

    The php errors mentioned by others above are probably because the double and single quotes that you copied from this example are not recognized by your server's php editor / compiler. After you copy the code into your own php file, locate each double quote, delete it, and reinsert using your computer's keyboard. Do the same for the single quotes. I guess it depends on your keyboard language setting, which gives you a different variant of single and double quotes.

    The double quotes in this example are ” ........ my server's php wants to see " A very subtle change. HTH.

  19. Aki
    February 4, 2015 at 2:41 pm

    Exact same question as bombay, what if you want the content, the inside of the content
    ?

  20. bombay
    January 28, 2015 at 2:26 pm

    If someone wants to search content of the website ?

  21. AppaCyber Net
    January 24, 2015 at 12:00 am

    tHIS is what I'm looking for ... I wanna have a try with it then. thanks for the tutorial

  22. Ali Hajjow
    January 22, 2015 at 12:13 pm

    So Coool Thank you

  23. John
    December 10, 2014 at 4:01 pm

    Same with next example. You have used double not single quotes:-

    load_file($target_url);
    foreach($html->find(‘a’) as $link){
    echo $link->href.'';
    }
    ?>

    Maybe these were acceptable with an earlier version of PHP but now are not?

    John

  24. John
    December 10, 2014 at 3:46 pm

    I tried checking the supplied PHP code with an online syntax checker at phpcodechecker.com/ and it too finds the same errors I do?

    Can you please advise why this snippets have syntax errors in them and post the corrected code?

    Thanks
    John

  25. John
    December 10, 2014 at 3:38 pm

    Apart from the very first example none of the rest worked for me? I got:-

    Parse error: syntax error, unexpected ':' in C:apache2.4htdocsexample4.php on line 7

    So it did not like the colon in the http: link. I saw the answer saying use a semi colon. It did not explain why? I tried this and then it object to the ">" sign:-

    Parse error: syntax error, unexpected '>' in C:apache2.4htdocsexample4.php on line 7

    Any idea why? is there a syntax guide as PHP seems very fussy about syntax and you do not explain why or what could affect this. Your examples do not work as specified on my set up. I have Apache 2.4 working fine and PHP Version 5.6.1.

    Are there other settings which affect whether PHP will accept URLs and the -> sign which again you do not explain?

    Thanks

    John

  26. frank754
    January 16, 2011 at 2:39 am

    OK - I thought I posted a comment here earlier, but maybe not. I said I couldn't get it to work. Anyway, I finally figured out a cool way to crawl a page for links and then get the images from those links in list form. A bit crude, but it works. I had to use a base url also, to find the the directories for the images.
    load_file($page);

    // Find all links
    $links = array();
    foreach($html->find('a') as $element) {
    $links[] = $element;
    }
    reset($links);
    echo "Links found on $page:";
    foreach ($links as $out) {
    echo "$out->href";
    }
    echo"";

    // Parse resultant individual pages for images

    foreach ($links as $subpage) {
    $base = "http://viewoftheblue.com/photography/";
    // Create DOM from URL
    $subpage = $subpage->href;
    $page = $base . $subpage;
    $html = file_get_html($page);

    // Find all images
    $images = array();
    foreach($html->find('img') as $element) {
    $images[] = $element->src;
    }
    reset($images);
    echo "Images found on $page:";
    foreach ($images as $out) {
    $url = "$out";
    echo "$url";
    }
    echo"";
    }
    ?>

  27. frank754
    January 15, 2011 at 11:15 pm

    Can't get it to work, I'm using WAMP and PHP. Get an internal 500 error. Do certain settings need to be enabled on PHP?
    I'm really trying hard to find a way to crawl the images on my site in the <img> tags, I tried Sphider and tried to modify that to no avail yet (it just returns urls)
    My site (viewoftheblue.com/photography) has nearly 3000 images and I need to somehow crawl it and get a list of images and also know what subdirectory the images are in, or else just a list of all the image files showing the path. Thanks

  28. James Bruce
    January 15, 2011 at 10:25 pm

    I think CURL might be needed and seems to be disabled by default on WAMP settings.

    To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

    Let us know if that helps

  29. James Bruce
    January 15, 2011 at 11:25 pm

    I think CURL might be needed and seems to be disabled by default on WAMP settings.

    To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

    Let us know if that helps

  30. Brevillebov800xl
    January 15, 2011 at 8:11 am

    Hi............

    i found your site through search engine.i like your web
    page content and very attractive theme.really i like this
    it.

    Thanks.:)

  31. Strosdegoz
    January 5, 2011 at 8:57 am

    I get this error when running example2.php

    Parse error: syntax error, unexpected ':' in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

    Help please

    • James Bruce
      January 15, 2011 at 10:26 pm

      try a ; instead of a :

  32. Strosdegoz
    January 5, 2011 at 9:57 am

    I get this error when running example2.php

    Parse error: syntax error, unexpected ':' in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

    Help please

  33. Manjula_puttasetty
    January 4, 2011 at 9:47 am

    I want to read part2

  34. Siddhartha
    December 19, 2010 at 12:33 pm

    Awesome man .. just tried it ..

  35. Sid
    December 19, 2010 at 1:19 pm

    Good and useful ..
    i will try it out 2day .

    Thanks

  36. JLG
    December 15, 2010 at 4:38 am

    Really nice, and useful.

    Thank you!.

    JLG

  37. Drone
    December 13, 2010 at 1:29 pm

    C'mon this is Script-Kiddie stuff... go perl, a little lwp stuff and a sprinkling of regex. Who needs php (yuk) much less Google!? Or maybe you're hiding behind Google - eh? If-so, you are spineless.

    • James Bruce
      December 13, 2010 at 1:37 pm

      lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

      Not sure what you mean about hiding behind google?

  38. James Bruce
    December 13, 2010 at 2:37 pm

    lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

    Not sure what you mean about hiding behind google?

  39. reuven
    December 12, 2010 at 9:55 am

    Very nice indeed!
    thank you for the guide and code

  40. reuven
    December 12, 2010 at 10:55 am

    Very nice indeed!
    thank you for the guide and code

  41. Aibek
    December 12, 2010 at 7:00 am

    excellent article James!

  42. Theprateeksharma
    December 11, 2010 at 9:34 pm

    I love MUO
    The only thing i missed in MUO was code
    now
    This new section has brought a smile on my face
    I want to demand even more
    THUMBS UP to MUO

  43. Theprateeksharma
    December 11, 2010 at 10:34 pm

    I love MUO
    The only thing i missed in MUO was code
    now
    This new section has brought a smile on my face
    I want to demand even more
    THUMBS UP to MUO

  44. James Bruce
    December 11, 2010 at 5:40 pm

    Hi Shan.

    Regarding WordPress - If you have a free blog with WordPress.com, then I'm afraid you can't run your own PHP files on their server as it would be a security risk for them, and no there isn't FTP access for that either. I should have made it clearer, but "self hosted" means running the WordPress system on your own website, for which you likely pay every month for. If you have your own domain name that doesnt have "wordpresss.com" in it, then you probably have it running on your own server.

    If it's your own server, since you mentioned you cannot FTP into it, then you must have installed WordPress via Fantastico or Godaddy Hosting Connection easy install system, or by getting someone else to do it? You should still be able to FTP into your site... using any of FTP mentioned in the linked tutorials, just point them to "ftp.yourdomain.com" (replacing yourdomain, obviously), and using the password and username that your hosting company sent you originally. Also, if you can get access to your CPanel, there should be a java-based FTP manager in there that you could try to explore the FTP with. If you have no idea what I just said, don't worry - you probably don't own your own website, so you do option 3 to run a local test server.

    Regarding the differences between no1 and 3 -> 3 is possible to do without a remote server at all, you can simply setup a test server on your own machine that will allow you to run PHP. You don't have to install wordpress on it, but since WordPress is PHP based, then that is also something you can install to test out if you want. No 1 is the easiest option for people who own their own website, and I only mentioned WordPress because if you're running your own install of WordPress, then your server can definately do PHP.

    If you tell me your wordpress blog address or domain, ill be able to explain a little better...

  45. shan
    December 11, 2010 at 4:20 pm

    In the blog,
    "you will need a server to run PHP. You have a number of options here:". I thought I had the first option - wordpress blog.

    Does it support php ? it looks like i cannot ftp.

    If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

  46. shan
    December 11, 2010 at 5:20 pm

    In the blog,
    "you will need a server to run PHP. You have a number of options here:". I thought I had the first option - wordpress blog.

    Does it support php ? it looks like i cannot ftp.

    If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

  47. Jack Cola
    December 11, 2010 at 10:35 am

    Hi James,
    I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn't get published either. But I can see why it programming articles don't get through because its not part of their range of articles.

    One of my friends is working on a HTML manual for MUO, so its a start...

  48. Dominic Chester
    December 11, 2010 at 8:24 am

    PLEASE PLEASE PLEASE do a part 2.

    • James Bruce
      December 11, 2010 at 8:53 am

      Done!

    • d
      September 17, 2016 at 5:57 am

      n

  49. James Bruce
    December 11, 2010 at 9:56 am

    Thanks for the encouraging words, Jack, appreciated. You're right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

    Do you have any other programming experience? PHP is easy because it's untyped so no worries about casting, no memory mangement etc - it's quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

  50. Jack Cola
    December 11, 2010 at 3:32 am

    Welcome to MUO James, it was a great article. For the past 3 years, I wanted to learn PHP, but never got around to it. Hopefully, in 2011 I will eventually learn it.

    But do continue on part 2. It looks interesting when you continue on your examples. Ie example3.php, example4.php etc

    Make Use Of needs to make use of more coding examples :D

    • James Bruce
      December 11, 2010 at 8:56 am

      Thanks for the encouraging words, Jack, appreciated. You're right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

      Do you have any other programming experience? PHP is easy because it's untyped so no worries about casting, no memory mangement etc - it's quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

    • Jack Cola
      December 11, 2010 at 9:35 am

      Hi James,
      I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn't get published either. But I can see why it programming articles don't get through because its not part of their range of articles.

      One of my friends is working on a HTML manual for MUO, so its a start...

    • Dominic Ejaria
      January 3, 2015 at 2:16 pm

      HI
      I am looking for a developer that can create a web crawler that i can incorporate into a search engine i have built. Do you know of any developer that can assist with this .

Leave a Reply

Your email address will not be published. Required fields are marked *