How to Build a Basic Web Crawler to Pull Information From a Website
Advertisement

Programs that read information from websites, or web crawlers, have all kinds of useful applications. You can scrape for stock information, sports scores, text from a Twitter account, or pull prices from shopping websites.

Writing these web crawling programs is easier than you might think. Python has a great library for writing scripts that extract information from websites. Let’s look at how to create a web crawler using Scrapy.

Installing Scrapy

Scrapy is a Python library that was created to scrape the web and build web crawlers. It is fast, simple, and can navigate through multiple web pages without much effort.

Scrapy is available through the Pip Installs Python (PIP) library, here’s a refresher on how to install PIP on Windows, Mac, and Linux How to Install Python PIP on Windows, Mac, and Linux Many Python developers rely on a tool called PIP for Python to streamline development. Here's how to install Python PIP. Read More .

Using a Python Virtual Environment is preferred because it will allow you to install Scrapy in a virtual directory that leaves your system files alone. Scrapy’s documentation recommends doing this to get the best results.

Create a directory and initialize a virtual environment.

mkdir crawler
cd crawler
virtualenv venv
. venv/bin/activate

You can now install Scrapy into that directory using a PIP command.

pip install scrapy

A quick check to make sure Scrapy is installed properly

scrapy
# prints
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
...

How to Build a Web Crawler

Now that the environment is ready you can start building the web crawler. Let’s scrape some information from a Wikipedia page on batteries: https://en.wikipedia.org/wiki/Battery_(electricity).

The first step to write a crawler is defining a Python class that extends from Scrapy.Spider. This gives you access to all the functions and features in Scrapy. Let’s call this class spider1.

A spider class needs a few pieces of information:

  • a name for identifying the spider
  • a start_urls variable containing a list of URLs to crawl from  (the Wikipedia URL will be the example in this tutorial)
  • a parse() method which is used to process the webpage to extract information
import scrapy

class spider1(scrapy.Spider):
    name = 'Wikipedia'
    start_urls = ['https://en.wikipedia.org/wiki/Battery_(electricity)']

    def parse(self, response):
        pass

A quick test to make sure everything is running properly.

scrapy runspider spider1.py
# prints
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-11-23 09:09:21 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2017-11-23 09:09:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
...

Turning Off Logging

Running Scrapy with this class prints log information that won’t help you right now. Let’s make it simple by removing this excess log information. Use a warning statement by adding code to the beginning of the file.

import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)

Now when you run the script again, the log information will not print.

Using the Chrome Inspector

Everything on a web page is stored in HTML elements. The elements are arranged in the Document Object Model (DOM). Understanding the DOM is critical JavaScript and Web Development: Using the Document Object Model This article will introduce you to the document skeleton that JavaScript works with. Having a working knowledge of this abstract document object model, you can write JavaScript that works on any web page. Read More to getting the most out of your web crawler. A web crawler searches through all of the HTML elements on a page to find information, so knowing how they’re arranged is important.

Google Chrome has tools that help you find HTML elements faster. You can locate the HTML for any element you see on the web page using the inspector.

  • Navigate to a page in Chrome
  • Place the mouse on the element you would like to view
  • Right-click and select Inspect from the menu

These steps will open the developer console with the Elements tab selected. At the bottom of the console, you will see a tree of elements. This tree is how you will get information for your script.

Extracting the Title

Let’s get the script to do some work for us; A simple crawl to get the title text of the web page.

Start the script by adding some code to the parse() method that extracts the title.

...
    def parse(self, response):
        print response.css('h1#firstHeading::text').extract()
...

The response argument supports a method called CSS() that selects elements from the page using the location you provide.

In this example, the element is h1.firstHeading. Adding ::text to the script is what gives you the text content of the element. Finally, the extract() method returns the selected element.

Running this script in Scrapy prints the title in text form.

[u'Battery (electricity)']

Finding the Description

Now that we’ve scraped the title text let’s do more with the script. The crawler is going to find the first paragraph after the title and extract this information.

Here’s the element tree in the Chrome Developer Console:

div#mw-content-text>div>p

The right arrow (>) indicates a parent-child relationship between the elements.

This location will return all of the p elements matched, which includes the entire description. To get the first p element you can write this code:

response.css('div#mw-content-text>div>p')[0]

Just like the title, you add CSS extractor ::text to get the text content of the element.

response.css('div#mw-content-text>div>p')[0].css('::text')

The final expression uses extract() to return the list. You can use the Python join() function to join the list once all the crawling is complete.

    def parse(self, response):
        print ''.join(response.css('div#mw-content-text>div>p')[0].css('::text').extract())

The result is the first paragraph of the text!

An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is
...

Collecting JSON Data

Scrapy can extract information in text form, which is useful. Scrapy also lets you view the data JavaScript Object Notation (JSON). JSON is a neat way to organize information and is widely used in web development. JSON works pretty nicely with Python JSON Python Parsing: A Simple Guide There are libraries and tool-kits available for parsing and generating JSON from almost any language and environment. This article concentrates on methods and issues arising from JSON python parsing. Read More as well.

When you need to collect data as JSON, you can use the yield statement built into Scrapy.

Here’s a new version of the script using a yield statement. Instead of getting the first p element in text format, this will grab all of the p elements and organize it in JSON format.

...
    def parse(self, response):
        for e in response.css('div#mw-content-text>div>p'):
            yield { 'para' : ''.join(e.css('::text').extract()).strip() }
...

You can now run the spider by specifying an output JSON file:

scrapy runspider spider3.py -o joe.json

The script will now print all of the p elements.

[
{"para": "An electric battery is a device consisting of one or more electrochemical cells with external connections provided to power electrical devices such as flashlights, smartphones, and electric cars.[1] When a battery is supplying electric power, its positive terminal is the cathode and its negative terminal is the anode.[2] The terminal marked negative is the source of electrons that when connected to an external circuit will flow and deliver energy to an external device. When a battery is connected to an external circuit, electrolytes are able to move as ions within, allowing the chemical reactions to be completed at the separate terminals and so deliver energy to the external circuit. It is the movement of those ions within the battery which allows current to flow out of the battery to perform work.[3] Historically the term \"battery\" specifically referred to a device composed of multiple cells, however the usage has evolved additionally to include devices composed of a single cell.[4]"},
{"para": "Primary (single-use or \"disposable\") batteries are used once and discarded; the electrode materials are irreversibly changed during discharge. Common examples are the alkaline battery used for flashlights and a multitude of portable electronic devices. Secondary (rechargeable) batteries can be discharged and recharged multiple
...

Scraping Multiple Elements

So far the web crawler has scraped the title and one kind of an element from the page. Scrapy can also extract information from different types of elements in one script.

Let’s extract top IMDb Box Office hits for a weekend. This information is pulled from http://www.imdb.com/chart/boxoffice, in a table with rows for each metric.

The parse() method can extract more than one field from the row. Using the Chrome Developer Tools you can find the elements nested inside the table.

...
    def parse(self, response):
        for e in response.css('div#boxoffice>table>tbody>tr'):
            yield {
                'title': ''.join(e.css('td.titleColumn>a::text').extract()).strip(),
                'weekend': ''.join(e.css('td.ratingColumn')[0].css('::text').extract()).strip(),
                'gross': ''.join(e.css('td.ratingColumn')[1].css('span.secondaryInfo::text').extract()).strip(),
                'weeks': ''.join(e.css('td.weeksColumn::text').extract()).strip(),
                'image': e.css('td.posterColumn img::attr(src)').extract_first(),
            }
...

The image selector specifies that img is a descendant of td.posterColumn. To extract the right attribute, use the expression ::attr(src).

Running the spider returns JSON:

[
{"gross": "$93.8M", "weeks": "1", "weekend": "$93.8M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYWVhZjZkYTItOGIwYS00NmRkLWJlYjctMWM0ZjFmMDU4ZjEzXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Justice League"},
{"gross": "$27.5M", "weeks": "1", "weekend": "$27.5M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BYjFhOWY0OTgtNDkzMC00YWJkLTk1NGEtYWUxNjhmMmQ5ZjYyXkEyXkFqcGdeQXVyMjMxOTE0ODA@._V1_UX45_CR0,0,45,67_AL_.jpg", "title": "Wonder"},
{"gross": "$247.3M", "weeks": "3", "weekend": "$21.7M", "image": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjMyNDkzMzI1OF5BMl5BanBnXkFtZTgwODcxODg5MjI@._V1_UY67_CR0,0,45,67_AL_.jpg", "title": "Thor: Ragnarok"},
...
]

More Web Scrapers and Bots

Scrapy is a detailed library that can do just about any kind of web crawling that you ask it to. When it comes to finding information in HTML elements, combined with the support of Python, it’s hard to beat. Whether you’re building a web crawler or learning about the basics of web scraping the only limit is how much you’re willing to learn.

If you’re looking for more ways to build crawlers or bots you can try to build Twitter and Instagram bots using Python How to Build Twitter, Instagram, and Reddit Bots Using Python Want to build your own social media bots? Here's how to auto-post updates to Twitter, Instagram, and Reddit using Python. Read More . Python can build some amazing things in web development, so it’s worth going beyond web crawlers when exploring this language.

Explore more about: Coding Tutorials, Programming, Python, Web Crawlers, Webmaster Tools.

Whatsapp Pinterest

Enjoyed this article? Stay informed by joining our newsletter!

Enter your Email

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Evelina
    February 18, 2020 at 1:56 pm

    Hello there

    We would like to advertise on your website, makeuseof.com, with our website, we are in the IGaming niche Niche, we would be mostly interested to

    -Place a banner on your homepage (With Do-follow Link)
    -Text AD on your homepage (With Do-follow Link)
    -Or buy sponsored post on your blog (With Do-follow Link)

    Could you please send us an email back with an offer for any of the above advertising options that would work for you?

    Diana

  2. Nicholas
    February 18, 2020 at 12:45 pm

    Hello there

    We would like to advertise on your website, makeuseof.com, with our website, we are in the Gambling Niche, we would be mostly interested to

    -Place a banner on your homepage (With Do-follow Link)
    -Text AD on your homepage (With Do-follow Link)
    -Or buy sponsored post on your blog (With Do-follow Link)

    Could you please send us an email back with an offer for any of the above advertising options that would work for you?

    Nicholas

  3. Nicholas
    February 16, 2020 at 8:57 am

    Hi there

    We would like to advertise on your website, makeuseof.com, with our website, we are in the Gambling Niche, we would be mostly interested to

    -Place a banner on your homepage (With Do-follow Link)
    -Text AD on your homepage (With Do-follow Link)
    -Or buy sponsored post on your blog (With Do-follow Link)

    Could you please send us an email back with an offer for any of the above advertising options that would work for you?

    Nicholas

  4. Nicholas
    February 11, 2020 at 4:15 pm

    Hi there

    We would like to advertise on your website, makeuseof.com, with our website, we are in the Affiliate Marketing Niche, we would be mostly interested to

    -Place a banner on your homepage (With Do-follow Link)
    -Text AD on your homepage (With Do-follow Link)
    -Or buy sponsored post on your blog (With Do-follow Link)

    Could you please send us an email back with an offer for any of the above advertising options that would work for you?

    Nicholas

  5. Alaa Qweider
    January 6, 2019 at 12:49 pm

    Nice, Am trying to be lazy,

    I want to build a crawler that i can feed websites or url I want it to crawl, take data out of processing, take keywords from the website and save link for hits, take dates etc .. maybe ocr model also .. it is a hefty project but who knows

  6. Alise
    June 19, 2018 at 5:26 pm

    Basically, there are many solutions for web crawling now, and you can use existing products or, if you want a qualified solution, contact the outsourcing company. For example, such well-known website archiving service PageFreezer was developed by Ukrainian software development company Redwerk. It specializes not only in data mining industry but it also builds e-governmental, media, eHealth solutions for companies all over the world. I found many great reviews about them, and think they can create quality product depending on various needs.

  7. yeoung moh
    December 19, 2017 at 3:28 pm

    what about writing bot for online auto trading (like for forex)

  8. jwon
    December 19, 2017 at 11:32 am

    I need to do two things daily with a certain website: 1) authenticate on the login page with userid/password; 2) download the html of a particular page on the site. I am not able to attend to this process every day and do not have regular computer access, but I need to be able to review the files later for changes. How can I learn this or pay someone a reasonable rate for the simple task?