This repo contains materials of my tutorial at PyData Seattle 2017
The presentation slides can be found here
This tutorial requires you have an updated version of Python and Jupyter Notebook. I recommend Anaconda, which is a Python distribution. It's free and open source. You can download Anaconda from here
The notebooks contained in this tutorial is tested on Python 3, but should be mostly compatible with Python 2.x as well.
- Download drivers if you use Selenium
-
Chapter 1: HTTP, GET requests, HTML
Here let's do a quick HTML -
Chapter 2: Study the site, and get a list of movies
Before we scrape the top movies, we have to find a list of them. We can do that in one request . This is really easy as long as there are some consistent patterns to the page. -
Chapter 3: Let's get details about one movie
I want to see how much money the top 200 movie made -
Chapter 4: Let's put all the code together, add loop magic
Now we have all 200 movies. Hooray! -
Extra Credit: Scraping the unscrapable - Selenium
I wonder if top grossing movies also have the best reviews (it sounds obvious but is it really?). Let's do some interactive web scraping with Selenium. -
Lab: HackerNews Trends Project
Let's work on a Scrapping project to extract HackerNews trends to practice what we learnt.
- Introduction to HTML
- Discover Devtools: a free interactive intro to using Chrome's developer tools
- httpbin lets you easily test a lot of HTTP functionality. (And it's written in Python; check it out!)
- Web Scraping with Beautiful Soup from some Stanford class
- 4 Best Practices of Web Scraping: not bad advice
- The BeautifulSoup documentation is very good. There are all sorts of methods and they're all described here!
- Newspaper, a Python library that might be helpful when you want to extract an article from a web site
- Here is an introduction to regular expressions
- Web Scraping 101 with Python by Greg Reda (beautifulsoup)
- Scrapy
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way. - Scrapy Tutorial
- The Selenium docs for Locating Elements
- You can use this XPATH selector tutorial when you need to construct an xpath selector. You can also check out the w3schools XPath syntax guide.
- The 30 CSS Selectors you Must Memorize
If you find a mistake, or would like to share something that wasn't covered, create an issue and send a pull request. Thank you.