A repo for Columbia Journalism Investigations workshop
Welcome to a short tutorial on webscraping with Python presented for CJI's fellows in December 2021. This tutorial aims to provide beginner familiarity with web scraping, with basic Python syntax and some common packages.
Python is a versatile language that can save you time and help you tell better stories whether by automating tasks, wrangling large datasets or simply putting your analysis all in one reproducable place. It is also a great tool to gather data to use for stories.
A lot of revelatory data doesn't come in a nice, clean format. Many times we have to compile it ourselves. With web scraping, we can do that and more as you'll see.
We'll be using:
- A terminal window
- The Python interpreter
- A Python package installer
- A Jupyter notebook
First things first, we're going to do some setup on our computers before we can start writing Python. We will be using Python 3 (preferably 3.6 or higher).
- Install Python, if not already
- Install
pip3
- Install Jupyter notebooks
Clone this repo and then run the code below to install the required packages.
pip3 install -r requirements.txt
You can also install git
following these instructions and then run the command above. Hint on a Mac, typing git --version
will likely prompt an install wizard if you have the XCode tools installed.
On a Mac, open a Terminal window and type python3
. (You can find the Terminal by using the spotlight and searching for "Terminal.") If you see something like this:
Python 3.6.8 (v3.6.8:3c6b436a57, Dec 24 2018, 02:04:31)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
You're in business.
On a Windows machine, if you installed the Microsoft store version, open a command prompt and type python3
or you can find it in the Start menu.
If Windows, I would suggest using the Microsoft store version if you're looking to learn. You can also do the full install using the Windows installer instructions. Note: Make sure to check the box during the install process to add Python to your path.
Short for "Pip Installs Packages," pip is just what it sounds like. It's a package manager that installs python packages. We're going to use it for installing a series of Python packages that we'll use in our tutorial and our scraper.
On Windows, if you installed from the Microsoft store, it should be included.
On Mac, to install pip, open the Terminal. Then copy and paste the commands below into your terminal window. We'll be installing just to our user profile, instead of the machine itself, so we won't run into problems with not being the administrator on the computers.
curl https://bootstrap.pypa.io/get-pip.py -o ~/Downloads/get-pip.py
Once the file has downloaded and if it is in your Downloads folder, enter this:
python3 ~/Downloads/get-pip.py
After install, type pip3
into your terminal and press Enter. A help menu should pop up. If not, make sure it's added to your path, try echo $PATH
and you should see it added to your PATH.
If you're still having trouble on Windows, this walkthrough should help.
Now that we have pip3
installed, we can use it to install another tool we're going to use: Juypter notebooks.
Jupyter notebooks are incredibly powerful, open source environment for Python scripting. We'll be using them to do some tutorials and data work but I use them every day in my job for all kinds of data work.
requests is a package that allows us to make HTTP requests (i.e. follow urls and return their content) in Python. Now that we've install pip, the installation is easy.
BeautifulSoup is an html parser that will allow us to interact with html and pull the data we want out of it.
lxml processes xml and html in Python and is used by BeautifulSoup to parse.
pyOpenSSL is used by requests and you can read more about it here.
pandas is a data science library for Python.
To install, copy and paste each of these lines into your terminal:
pip3 install notebook
pip3 install requests
pip3 install beautifulsoup4
pip3 install -U pyopenssl
pip3 install lxml
pip3 install pandas
Now, type:
jupyter notebook
If you're using a Mac that defaults to zsh
in the terminal, type:
python3 -m notebook
A notebook session should pop up in your browser of choice.
Now we're ready to get started.