How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)

web crawler   How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)Web Crawlers, sometimes called scrapers, automatically scan the Internet attempting to glean context and meaning of the content they find. The web wouldn’t function without them. Crawlers are the backbone of search engines which, combined with clever algorithms, work out the relevance of your page to a given keyword set.

The Google web crawler will enter your domain and scan every page of your website, extracting page titles, descriptions, keywords, and links  – then report back to Google HQ and add the information to their huge database.

Today, I’d like to teach you how to make your own basic crawler – not one that scans the whole Internet, though, but one that is able to extract all the links from a given webpage.


Generally, you should make sure you have permission before scraping random websites, as most people consider it to be a very grey legal area. Still, as I say, the web wouldn’t function without these kind of crawlers, so it’s important you understand how they work and how easy they are to make.

To make a simple crawler, we’ll be using the most common programming language of the internet – PHP. Don’t worry if you’ve never programmed in PHP – I’ll be taking you through each step and explaining what each part does. I am going to assume an absolute basic knowledge of HTML though, enough that you understand how a link or image is added to an HTML document.

Before we start, you will need a server to run PHP. You have a number of options here:

Getting Started

We’ll be using a helper class called Simple HTML DOM. Download this zip file, unzip it, and upload the simple_html_dom.php file contained within to your website first (in the same directory you’ll be running your programs from). It contains functions we will be using to traverse the elements of a webpage more easily. That zip file also contains today’s example code.

First, let’s write a simple program that will check if PHP is working or not. We’ll also import the helper file we’ll be using later. Make a new file in your web directory, and call it example1.php – the actual name isn’t important, but the .php ending is. Copy and paste this code into it:


<?php
include_once('simple_html_dom.php');
phpinfo();
?>

Access the file through your internet browser. If everything has gone right, you should see a big page of random debug and server information printed out like below – all from the little line of code! It’s not really what we’re after, but at least we know everything is working.

phpinfo   How To Build A Basic Web Crawler To Pull Information From A Website (Part 1)

The first and last lines simply tell the server we are going to be using PHP code. This is important because we can actually include standard HTML on the page too, and it will render just fine. The second line pulls in the Simple HTML DOM helper we will be using. The phpinfo(); line is the one that printed out all that debug info, but you can go ahead and delete that now. Notice that in PHP, any commands we have must be finished with a colon (;). The most common mistake of any PHP beginner is to forget that little bit of punctuation.

One typical task that Google performs is to pull all the links from a page and see which sites they are endorsing. Try the following code next, in a new file if you like.

<?php
include_once('simple_html_dom.php');

$target_url = “http://www.tokyobit.com/”;
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find(‘a’) as $link){
echo $link->href.”<br />”;
}
?>

You should get a page full of URLs! Wonderful. Most of them will be internal links, of course. In a real world situation, Google would ignore internal links and simply look at what other websites you're linking to, but that's outside the scope of this tutorial.

If you're running on your own server, go ahead and change the target_URL variable to your own webpage or any other website you'd like to examine.

That code was quite a jump from the last example, so let's go through in pseudo-code to make sure you understand what's going on.

Include once the simple HTML DOM helper file.

Set the target URL as http://www.tokyobit.com.

Create a new simple HTML DOM object to store the target page

Load our target URL into that object

For each link <a> that we find on the target page

- Print out the HREF attribute

That's it for today, but if you'd like a bit of challenge - try to modify to the second example so that instead of searching for links (<a> elements), it grabs images instead (<img>). Remember, the src attribute of an image specifies the URL for that image, not HREF.

Would you like learn more? Let me know in the comments if you're interested in reading a part 2 (complete with homework solution!), or even if you'd like a back-basics PHP tutorial - and I'll rustle one up next time for you. I warn you though - once you get started with programming in PHP, you'll start making plans to create the next Facebook, and all those latent desires for world domination will soon consume you. Programming is fun.

The comments were closed because the article is more than 180 days old.

If you have any questions related to what's mentioned in the article or need help with any computer issue, ask it on MakeUseOf Answers—We and our community will be more than happy to help.

30 Comments -

0 votes

Jack Cola

Welcome to MUO James, it was a great article. For the past 3 years, I wanted to learn PHP, but never got around to it. Hopefully, in 2011 I will eventually learn it.

But do continue on part 2. It looks interesting when you continue on your examples. Ie example3.php, example4.php etc

Make Use Of needs to make use of more coding examples :D

0 votes

James Bruce

Thanks for the encouraging words, Jack, appreciated. You’re right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

Do you have any other programming experience? PHP is easy because it’s untyped so no worries about casting, no memory mangement etc – it’s quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

0 votes

Jack Cola

Hi James,
I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn’t get published either. But I can see why it programming articles don’t get through because its not part of their range of articles.

One of my friends is working on a HTML manual for MUO, so its a start…

0 votes

James Bruce

Thanks for the encouraging words, Jack, appreciated. You’re right MUO does indeed need more coding examples. I plan to, if editors will allow it ;p

Do you have any other programming experience? PHP is easy because it’s untyped so no worries about casting, no memory mangement etc – it’s quite a beginner friendly language really. And the fact that you can just throw HTML in anywhere makes it rather easy to produce some useful output.

0 votes

Dominic Chester

PLEASE PLEASE PLEASE do a part 2.

0 votes

James Bruce

Done!

0 votes

Jack Cola

Hi James,
I have touched a bit on Python and Java, and I wrote a Java manual to help me study while I was learning it, and I submitted it to Simon when he was the PDF editor, but he and Aibek knocked it back, and then when I was PDF editor, it didn’t get published either. But I can see why it programming articles don’t get through because its not part of their range of articles.

One of my friends is working on a HTML manual for MUO, so its a start…

0 votes

shan

In the blog,
“you will need a server to run PHP. You have a number of options here:”. I thought I had the first option – wordpress blog.

Does it support php ? it looks like i cannot ftp.

If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

0 votes

shan

In the blog,
“you will need a server to run PHP. You have a number of options here:”. I thought I had the first option – wordpress blog.

Does it support php ? it looks like i cannot ftp.

If I install wordpress in my local machine as per 3rd option, how does it differ from option1? Thanks

0 votes

James Bruce

Hi Shan.

Regarding WordPress – If you have a free blog with WordPress.com, then I’m afraid you can’t run your own PHP files on their server as it would be a security risk for them, and no there isn’t FTP access for that either. I should have made it clearer, but “self hosted” means running the WordPress system on your own website, for which you likely pay every month for. If you have your own domain name that doesnt have “wordpresss.com” in it, then you probably have it running on your own server.

If it’s your own server, since you mentioned you cannot FTP into it, then you must have installed WordPress via Fantastico or Godaddy Hosting Connection easy install system, or by getting someone else to do it? You should still be able to FTP into your site… using any of FTP mentioned in the linked tutorials, just point them to “ftp.yourdomain.com” (replacing yourdomain, obviously), and using the password and username that your hosting company sent you originally. Also, if you can get access to your CPanel, there should be a java-based FTP manager in there that you could try to explore the FTP with. If you have no idea what I just said, don’t worry – you probably don’t own your own website, so you do option 3 to run a local test server.

Regarding the differences between no1 and 3 -> 3 is possible to do without a remote server at all, you can simply setup a test server on your own machine that will allow you to run PHP. You don’t have to install wordpress on it, but since WordPress is PHP based, then that is also something you can install to test out if you want. No 1 is the easiest option for people who own their own website, and I only mentioned WordPress because if you’re running your own install of WordPress, then your server can definately do PHP.

If you tell me your wordpress blog address or domain, ill be able to explain a little better…

0 votes

Theprateeksharma

I love MUO
The only thing i missed in MUO was code
now
This new section has brought a smile on my face
I want to demand even more
THUMBS UP to MUO

0 votes

Theprateeksharma

I love MUO
The only thing i missed in MUO was code
now
This new section has brought a smile on my face
I want to demand even more
THUMBS UP to MUO

0 votes

Aibek

excellent article James!

0 votes

reuven

Very nice indeed!
thank you for the guide and code

0 votes

reuven

Very nice indeed!
thank you for the guide and code

0 votes

James Bruce

lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

Not sure what you mean about hiding behind google?

0 votes

Drone

C’mon this is Script-Kiddie stuff… go perl, a little lwp stuff and a sprinkling of regex. Who needs php (yuk) much less Google!? Or maybe you’re hiding behind Google – eh? If-so, you are spineless.

0 votes

James Bruce

lol, yes Perl would be a much better choice. But this in this age of WordPress, I feel like PHP is better for readers to learn.

Not sure what you mean about hiding behind google?

0 votes

JLG

Really nice, and useful.

Thank you!.

JLG

0 votes

Sid

Good and useful ..
i will try it out 2day .

Thanks

0 votes

Siddhartha

Awesome man .. just tried it ..

0 votes

Manjula_puttasetty

I want to read part2

0 votes

Strosdegoz

I get this error when running example2.php

Parse error: syntax error, unexpected ‘:’ in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

Help please

0 votes

Strosdegoz

I get this error when running example2.php

Parse error: syntax error, unexpected ‘:’ in /home/content/s/t/r/strosdegoz/html/apps/webcrawler.php on line 3

Help please

0 votes

James Bruce

try a ; instead of a :

0 votes

Brevillebov800xl

Hi…………

i found your site through search engine.i like your web
page content and very attractive theme.really i like this
it.

Thanks.:)

0 votes

James Bruce

I think CURL might be needed and seems to be disabled by default on WAMP settings.

To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

Let us know if that helps

0 votes

James Bruce

I think CURL might be needed and seems to be disabled by default on WAMP settings.

To enable curl left click WAMP taskbar icon and choose PHP > PHP Extension > php_curl

Let us know if that helps

0 votes

frank754

Can’t get it to work, I’m using WAMP and PHP. Get an internal 500 error. Do certain settings need to be enabled on PHP?
I’m really trying hard to find a way to crawl the images on my site in the <img> tags, I tried Sphider and tried to modify that to no avail yet (it just returns urls)
My site (viewoftheblue.com/photography) has nearly 3000 images and I need to somehow crawl it and get a list of images and also know what subdirectory the images are in, or else just a list of all the image files showing the path. Thanks

0 votes

frank754

OK – I thought I posted a comment here earlier, but maybe not. I said I couldn’t get it to work. Anyway, I finally figured out a cool way to crawl a page for links and then get the images from those links in list form. A bit crude, but it works. I had to use a base url also, to find the the directories for the images.
load_file($page);

// Find all links
$links = array();
foreach($html->find(‘a’) as $element) {
$links[] = $element;
}
reset($links);
echo “Links found on $page:”;
foreach ($links as $out) {
echo “$out->href”;
}
echo””;

// Parse resultant individual pages for images

foreach ($links as $subpage) {
$base = “http://viewoftheblue.com/photography/”;
// Create DOM from URL
$subpage = $subpage->href;
$page = $base . $subpage;
$html = file_get_html($page);

// Find all images
$images = array();
foreach($html->find(‘img’) as $element) {
$images[] = $element->src;
}
reset($images);
echo “Images found on $page:”;
foreach ($images as $out) {
$url = “$out“;
echo “$url”;
}
echo””;
}
?>