logo_ironhack_blue 7

Lab | Web Scraping Multiple Pages

Business goal:

  • Check the case_study_gnod.md file.

  • Make sure you've understood the big picture of your project:

    • the goal of the company (Gnod),
    • their current product (Gnoosic),
    • their strategy, and
    • how your project fits into this context.

    Re-read the business case and the e-mail from the CTO, take a look at the flowchart and create an initial Trello board with the tasks you think you'll have to accomplish.

Instructions

Prioritize the MVP

In the previous lab, you had to scrape data about "hot songs". It's critical to be on track with that part, as it was part of the request from the CTO.

If you couldn't finish the first lab, use this time to go back there.

Expand the project

If you're done, you can try to expand the project on your own. Here are a few suggestions:

  • Find other lists of hot songs on the internet and scrape them too: having a bigger pool of songs will be awesome!
  • Apply the same logic to other "groups" of songs: the best songs from a decade or from a country / culture / language / genre.
  • Wikipedia maintains a large collection of lists of songs: https://en.wikipedia.org/wiki/Lists_of_songs

Practice web scraping

As you've seen, scraping the internet is a skill that can get you all sorts of information. Here are some little challenges that you can try to gain more experience in the field:

  • Retrieve an arbitrary Wikipedia page of "Python" and create a list of links on that page: url ='https://en.wikipedia.org/wiki/Python'
  • Find the number of titles that have changed in the United States Code since its last release point: url = 'http://uscode.house.gov/download/download.shtml'
  • Create a Python list with the top ten FBI's Most Wanted names: url = 'https://www.fbi.gov/wanted/topten'
  • Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe: url = 'https://www.emsc-csem.org/Earthquake/'
  • List all language names and number of related articles in the order they appear in wikipedia.org: url = 'https://www.wikipedia.org/'
  • A list with the different kind of datasets available in data.gov.uk: url = 'https://data.gov.uk/'
  • Display the top 10 languages by number of native speakers stored in a pandas dataframe: url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'