Tutorial: Web Scraping with Python

This repo contains materials of my tutorial at PyData Seattle 2017
The presentation slides can be found here

Environment and Installs

This tutorial requires you have an updated version of Python and Jupyter Notebook. I recommend Anaconda, which is a Python distribution. It's free and open source. You can download Anaconda from here

The notebooks contained in this tutorial is tested on Python 3, but should be mostly compatible with Python 2.x as well.

Download drivers if you use Selenium

Chapter 0: Setup Instructions
Chapter 1: HTTP, GET requests, HTML
Here let's do a quick HTML
Chapter 2: Study the site, and get a list of movies
Before we scrape the top movies, we have to find a list of them. We can do that in one request . This is really easy as long as there are some consistent patterns to the page.
Chapter 3: Let's get details about one movie
I want to see how much money the top 200 movie made
Chapter 4: Let's put all the code together, add loop magic
Now we have all 200 movies. Hooray!
Chapter 5: Scraping best practices
Extra Credit: Scraping the unscrapable - Selenium
I wonder if top grossing movies also have the best reviews (it sounds obvious but is it really?). Let's do some interactive web scraping with Selenium.
Lab: HackerNews Trends Project
Let's work on a Scrapping project to extract HackerNews trends to practice what we learnt.

Aditional resources

Introduction to HTML
Discover Devtools: a free interactive intro to using Chrome's developer tools
httpbin lets you easily test a lot of HTTP functionality. (And it's written in Python; check it out!)
Web Scraping with Beautiful Soup from some Stanford class
4 Best Practices of Web Scraping: not bad advice
The BeautifulSoup documentation is very good. There are all sorts of methods and they're all described here!
Newspaper, a Python library that might be helpful when you want to extract an article from a web site
Here is an introduction to regular expressions
Web Scraping 101 with Python by Greg Reda (beautifulsoup)
Scrapy
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy Tutorial
The Selenium docs for Locating Elements
You can use this XPATH selector tutorial when you need to construct an xpath selector. You can also check out the w3schools XPath syntax guide.
The 30 CSS Selectors you Must Memorize

Contribute!

If you find a mistake, or would like to share something that wasn't covered, create an issue and send a pull request. Thank you.

klq/python-webscrapes

Tutorial: Web Scraping with Python

Environment and Installs

Table of Contents

Aditional resources

Contribute!