Scrape all the tables from a Wikipedia article into a folder of CSV files (only tested on Supreme Court ones)
- write a test scraper to see if possible
- scrape all the term opinions
- limit the justices votes
- scrape individual cases
- scrape page information from each case (this will vary)
- Convert this repo into a Jupyter Notebook
This will be a Python 3.7 module that depends on the Beautiful Soup and requests packages.
- Clone and
cd
into this repo. - Install Python 3.7.
- Install requirements from pip with
pip install -r requirements.txt
. - If on Windows, download the
.whl
for thelxml
parser and install it locally.
Just import the module and call the scrape
function. Pass it the full URL of a Wikipedia article, and a simple string (no special characters or filetypes) for the output name. The output will all be written to the output_name
folder, with files named output_name.csv
, output_name_1.csv
, etc.
import test_scrape
test_scrape.scrape(
url="https://en.wikipedia.org/wiki/2000_term_opinions_of_the_Supreme_Court_of_the_United_States"
output_name="2000_term"
)
Inspecting the output with Bash gives the following results:
$ ls 2000_term/
2000_term.csv 2000_term1.csv
$ cat 2000_term/2000_term.csv
"#","Case name and citation","Argued","Decided","Rehnquist","Stevens","O'Connor","Scalia","Kennedy","Souter","Thomas","Ginsburg","Breyer"
"1","Artuz v. Bennett, 531 U.S. 4","October 10, 2000","November 7, 2000","","","","","","","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"","","","","","","","","","","","",""
"2","Cleveland v. United States, 531 U.S. 12","October 10, 2000","November 7, 2000","","","","","","","","","","","","","","","","","",""
This can always be cleaner. This can easily be generalized but will be very utilitarian (by design)