How to Scrape Images From the Web in Python

A Python image scraper isn't just a tool for sharpening your programming skills. You can also use it to source images for a machine learning project, or generate site thumbnails. While there may be other ways to do similar things, nothing can beat the control you have using tools you build yourself.

Learn how to scrape images from any website using Python and the BeautifulSoup library.

Is Image Scraping Legal?

Like more generalized web scraping, image scraping is a method for downloading website content. It's not illegal, but there are some rules and best practices you should follow. First, you should avoid scraping a website if it explicitly states that it does not want you to. You can find this out by looking for a /robots.txt file on the target site.

Most websites allow web crawling because they want search engines to index their content. You can scrape such websites since their images are publicly available.

However, just because you can download an image, that doesn't mean you can use it as if it were your own. Most websites license their images to prevent you from republishing them or reusing them in other ways. Always assume that you cannot reuse images unless there is a specific exemption.

Python Package Set Up

You'll need to install a few packages before you begin. If you don't have Python installed on your computer, visit the official python.org website to download and install the latest version.

Next, open your terminal to your project folder and activate a Python virtual environment to isolate your dependencies.

Finally, install the requests and BeautifulSoup packages using pip:

        pip install bs4 requests

Image Scraping With Python

For this image scraping tutorial, you'll use the requests library to fetch a web page containing the target images. You'll then pass the response from that website into BeautifulSoup to grab all image link addresses from img tags. You'll then write each image file into a folder to download the images.

How to Fetch Image URLs With Python's BeautifulSoup

Now go ahead and create a Python file in your project root folder. Ensure that you append the .py extension to the filename.

Each code snippet in this tutorial continues from the previous one.

Open the Python file with any good code editor and use the following code to request a web page:

        import requests
URL = "imagesiteURL" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
print(getURL.status_code)

If the above program outputs a 200 response code, the request was successful. Otherwise, you might want to ensure that your network connection is stable. Also, ensure that you've supplied a valid URL.

Now use BeautifulSoup to read the content of the web page with the aid of the html_parser:

        from bs4 import BeautifulSoup
 
soup = BeautifulSoup(getURL.text, 'html.parser')
 
images = soup.find_all('img')
print(images)

This code creates a list of objects, each representing an image from the web page. However, what you need from this data is the text of each image's src attribute.

To extract the source from each img tag:

        imageSources = []
 
for image in images:
    imageSources.append(image.get('src'))
 
print(imageSources)

Rerun your code, and the image addresses should now appear in a new list (imageSources). You've successfully extracted each image source from the target web page.

How to Save the Images With Python

First, create a download destination folder in your project root directory and name it images.

For Python to successfully download the images, their paths need to be full absolute URLs. In other words, they need to include the "http://" or "https://" prefix, plus the full domain of the website. If the web page references its images using relative URLs, you'll need to convert them into absolute URLs.

In the easy case, when the URL is absolute, initiating the download is just a case of requesting each image from the earlier extracted sources:

        for image in imageSources:
    webs = requests.get(image)
    open('images/' + image.split('/')[-1], 'wb').write(webs.content)

The image.split('/')[-1] keyword splits the image link at every forward-slash (/). Then it retrieves the image file name (including any extension) from the last element.

Bear in mind that, in rare cases, image filenames might clash, resulting in download overwrites. Feel free to explore solutions to this problem as an extension to this example.

Absolute URLs can get quite complicated, with lots of edge cases to cover. Fortunately, there's a useful method in the requests.compat package called urljoin. This method returns a full URL, given a base URL and a URL which may be relative. It allows you to resolve values you'll find in href and src attributes.

The final code looks like this:

        from bs4 import BeautifulSoup
URL = "imagesiteURL" # Replace this with the website's URL
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
 
images = soup.find_all('img')
resolvedURLs = []
 
for image in images:
    src = image.get('src')
    resolvedURLs.append(requests.compat.urljoin(URL, src))
 
for image in resolvedURLs:
    webs = requests.get(image)
    open('images/' + image.split('/')[-1], 'wb').write(webs.content)

Never Go Short of Image Data

Many image recognition projects hit a brick wall due to an inadequate amount of images to train a model. But you can always scrape images from websites to boost your data repository. And thankfully, Python is a powerful image scraper you can use continuously without the fear of getting priced out.

If you're interested in fetching other types of data from the web, you might want to find out how to use Python for general web scraping.