Ironhack logo

Lab | Web Scraping

Introduction

As you have learned in the lesson, Web "scraping" (also called "web harvesting", "web data extraction" or even "web data mining"), can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner". Or, in other words: instead of a human end-user clicking away in their web browser and copy-pasting interesting parts into, say, a spreadsheet, web scraping offloads this task to a computer program which can execute it much faster, and more correctly, than a human can.

Data scientists have often found web scraping to be a powerful tool to have in their arsenal, as many data science projects starts with the first step of obtaining an appropiate data set, so why not utilize the information the web provides?

In this lab, you will practice a series of exercises to test your web scraping skills. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.

Getting Started

Open the main.ipynb file in the your-code directory. There are a bunch of questions to be solved. Each exercise is independent from the previous one. If you get stuck in one exercise you can skip to the next one. Read each instruction carefully and provide your answer beneath it.

Deliverables

  • main.ipynb with your responses to each of the exercises.

Submission

Upon completion, add your deliverables to git. Then commit git and push your branch to the remote.

Resources

Web Scraping Tutorial Dataquest

Web Scraping Tutorial Kdnuggets

HTML Scraping

The Anatomy of a Search Engine

Additional Challenges for the Nerds

If you are way ahead of your classmates and willing to accept some tough challenges about Web scraping you will find five bonus questions in the main.ipynb.