You've definitely started to hone your skills at scraping now! With that, let's look at another data format you're apt to want to pull from the web: images! In this lesson, you'll see how to save images from the web as well as display them in a Pandas DataFrame for easy perusal!
You will be able to:
- Select specific elements from HTML using Beautiful Soup
- Identify and scrape images from a web page
Start with the same page that you've been working with: books.toscrape.com.
from bs4 import BeautifulSoup
import requests
html_page = requests.get('http://books.toscrape.com/') # Make a get request to retrieve the page
soup = BeautifulSoup(html_page.content, 'html.parser') # Pass the page contents to beautiful soup for parsing
warning = soup.find('div', class_="alert alert-warning")
book_container = warning.nextSibling.nextSibling
First, simply retrieve a list of images by searching for img
tags with beautiful soup:
images = book_container.findAll('img')
ex_img = images[0] # Preview an entry
ex_img
<img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/>
# Use tab complete to preview what types of methods are available for the entry
# ex_img.
# While there's plenty of other methods to explore, simply select the url for the image for now.
ex_img.attrs['src']
'media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg'
Great! Now that you have a URL (well, a URL extension to be more precise) you can download the image locally!
import shutil
url_base = "http://books.toscrape.com/"
url_ext = ex_img.attrs['src']
full_url = url_base + url_ext
r = requests.get(full_url, stream=True)
if r.status_code == 200:
with open("images/book1.jpg", 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
You can also run a simple bash command in a standalone cell to preview that the image is indeed there:
ls images/
book-section.png book14.jpg book2.jpg book7.jpg
book1.jpg book15.jpg book20.jpg book8.jpg
book10.jpg book16.jpg book3.jpg book9.jpg
book11.jpg book17.jpg book4.jpg
book12.jpg book18.jpg book5.jpg
book13.jpg book19.jpg book6.jpg
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img = mpimg.imread('images/book1.jpg')
imgplot = plt.imshow(img)
plt.show()
You can even display images within a pandas DataFrame by using a little HTML yourself!
import pandas as pd
from IPython.display import Image, HTML
row1 = [ex_img.attrs['alt'], '<img src="images/book1.jpg"/>']
df = pd.DataFrame(row1).transpose()
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))
title | cover | |
---|---|---|
0 | A Light in the Attic |
data = []
for n, img in enumerate(images):
url_base = "http://books.toscrape.com/"
url_ext = img.attrs['src']
full_url = url_base + url_ext
r = requests.get(full_url, stream=True)
path = "images/book{}.jpg".format(n+1)
title = img.attrs['alt']
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
row = [title, '<img src="{}"/>'.format(path)]
data.append(row)
df = pd.DataFrame(data)
print('Number of rows: ', len(df))
df.columns = ['title', 'cover']
HTML(df.to_html(escape=False))
Number of rows: 20
Voila! You now know how to use your knowledge of HTML and Beautiful Soup to scrape images. You really are turning into a scraping champion! Now, go get scraping!