How To Build A Basic Web Crawler To Pull Information From A Website
Pinterest Stumbleupon Whatsapp
Advertisement

Have you ever wanted to programmatically capture specific information from a website for further processing? Say something like sports scores, stock market trends or the latest fad, bitcoin and other crypto-currency prices? If the information you need is available on a website, you can write a crawler (also known as a scraper or a spider) to navigate the website and extract just what you need. Let us find out how to do that in python.

Please note that several websites discourage using a crawler to access information that the website provides. So please check the website terms and conditions before deploying a crawler on any website.

Installing Scrapy

We use a python module called Scrapy for handling the actual crawling. It is fast, simple and can navigate multiple web pages just like you can with a browser.

Note, however, that scrapy has no facilities to process javascript when navigating the website. So those websites and apps that use javascript to manipulate the user interface cannot be crawled properly with this approach.

Let us now install scrapy. We use virtualenv Learn How to Use the Python Virtual Environment Learn How to Use the Python Virtual Environment Whether you are an experienced Python developer, or you are just getting started, learning how to setup a virtual environment is essential for any Python project. Read More to install scrapy. This allows us to install scrapy in a directory without affecting other system installed modules.

Create a directory and initialize a virtual environment in that directory.

mkdir crawler
cd crawler
virtualenv venv
. venv/bin/activate

You can now install scrapy into this directory.

pip install scrapy

Check that scrapy is installed properly.

scrapy
# prints
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

Building a Web Site Crawler (also called a Spider)

Let us now write a crawler for loading some information. We start by scraping some information from a Wikipedia page on a battery from https://en.wikipedia.org/wiki/Battery_(electricity).

The first step in writing a crawler is to define a python class which extends from scrapy.Spider. Let us call this class spider1.

As a minimum, a spider class requires the following:

  • a name for identifying the spider, “Wikipedia” in this case.
  • a start_urls variable containing a list of URLs to begin crawling from. We use the Wikipedia URL shown above for our first crawl.
  • a parse() method which – even though a no-op for now – is used to process the webpage to extract what we want.
import scrapy

class spider1(scrapy.Spider):
    name = 'Wikipedia'
    start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']

    def parse(self, response):
        pass

We can now run this spider to ensure everything is working correctly. It is run as follows.

scrapy runspider spider1.py
# prints
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
...

Turning Off Logging

As you can see, running scrapy with our minimal class generates a bunch of output which does not make much sense to us. Let us set the logging level to warning and retry. Add the following lines to the beginning of the file.

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

On re-running the spider, we should see a minimum of the log messages.

Using Chrome Inspector

Extracting information from a web page consists of determining the position of the HTML element from which we want information. A nice and easy way of finding the position of an element Figure Out Website Problems With Chrome Developer Tools Or Firebug Figure Out Website Problems With Chrome Developer Tools Or Firebug If you've been following my jQuery tutorials so far, you may have already run into some code problems and not known how to fix them. When faced with a non-functional bit of code, it's very... Read More from the Chrome Web Browser is to use the Inspector.

  • Navigate to the correct page in Chrome.
  • Place the mouse on the element for which you want the information.
  • Right-click to pull up the context menu.
  • Select Inspect from the menu.

Opening the Developer Console

That should pop up the developer console with the Elements tab selected. Down below the tab, you should see the status bar with the position of the element shown as follows:

html body div#content.mw-body h1#firstHeading.firstHeading.

As we explain below, you need some or all parts of this position.

Extracting the Title

Let us now add some code to the parse() method to extract the title of the page.

...
    def parse(self, response):
        print response.css('h1#firstHeading::text').extract()
...

The response argument to the method supports a method called css() which selects elements from the page using the given location. For our case, the element is h1.firstHeading. We need the text content of the element so we add ::text to the selection. Finally, the extract() method returns the selected element.

On running scrapy once again on this class, we get the following output:

[u'Battery (electricity)']

This shows the title has been extracted into a list of unicode strings.

How About the Description?

To demonstrate some more aspects of extracting data from web pages, let us get the first paragraph of the description from the above Wikipedia page.

On inspection using the Chrome Developer Console, we find that the location of the element is (The right angle bracket (>) indicates a parent-child relationship between the elements):

div#mw-content-text>div>p

This location returns all the p elements matched, which includes the entire description. Since we want just the first p element, we use the following extractor:

response.css('div#mw-content-text>div>p')[0]

To extract just the text content, we add CSS extractor ::text:

response.css('div#mw-content-text>div>p')[0].css('::text')

The final expression uses extract() which returns a list of unicode strings. We use the python join() function to join the list.

    def parse(self, response):
        print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract())

The output from running scrapy with this class is what we are looking for:

An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is
...

Collecting Data Using yield

The above code prints the extracted data to the console. When you need to collect data as JSON, you can use the yield statement. The way yield works is as follows — executing a function which contains a yield statement returns what is known as a generator to the caller. The generator is a function which the caller can repeatedly execute till it terminates.

Here is code similar to the above, but which uses the yield statement to return the list of p elements within the HTML.

...
    def parse(self, response):
        for e in response.css('div#mw-content-text>div>p'):
            yield { 'para' : ''.join(e.css('::text').extract()).strip() }
...

You can now run the spider by specifying an output JSON file as follows:

scrapy runspider spider3.py -o joe.json

The output generated is as follows:

[
{"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term \"battery\" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"},
{"para": "Primary (single-use or \"disposable\") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple
...

Processing Multiple Bits of Information

Let us now look into extracting multiple bits related of information. For this example, we will extract top IMDb Box Office hits for the current weekend. This information is available at http://www.imdb.com/chart/boxoffice, in a table with a row of information for each hit.

We extract various fields in each row using the following parse() method. Again the element CSS locations were determined using Chrome Developer Console as explained above:

...
    def parse(self, response):
        for e in response.css('div#boxoffice>table>tbody>tr'):
            yield {
                'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(),
                'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(),
                'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(),
                'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(),
                'image': e.css('td.posterColumn img::attr(src)').extract_first(),
            }
...

Note that the image selector above specifies that img is a descendant of td.posterColumn, and we are extracting the attribute called src using the expression ::attr(src).

Running the spider now returns the following JSON:

[
{"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"},
{"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"},
{"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"},
...
]

Using Your Crawler

Let us now conclude this article with a few salient points:

  • Using python with scrapy makes it easy to write website crawlers to extract any information you need.
  • Chrome Developer Console (or Firefox’s Firebug tool) helps in locating element locations to extract.
  • Python’s yield statement helps in extracting repeated data elements.

Do you have any specific projects in mind for website scraping? And what issues have you faced trying to get it going? Please let us know in the comments below.

Image Credit: dxinerz/Depositphotos | Lulzmango/Wikimedia Commons

Leave a Reply

Your email address will not be published. Required fields are marked *