Yale University CPSC 437 Database SEC Python Web Scraper

This repository contains a Python Web scraper for parsing NPORT-P filings (fund holdings) from SEC's website, EDGAR. We (Jason Wu, Michael Lewkowicz, Kevin Zhang - Yale undergrads) forked this repository from CodeWritingCow.

In this fork, we modified the original code to work with the new Edgar website (as of Dec 5th, 2020). These modifications were very signficant. If you look at our web scraper next to the original repo, the code is almost completely different, though we re-use some of the original helper functions written by CodeWritingCow stored in helper.py. In addition, we have made the following modifications:

Exclusively target and scrape NPORT-P filings
Collect issuers, total value at the time of filing, number of shares, and other relevant data
Directly insert this data into a mySQL database instead of a tsv file

In addition, note that the documentation is a mix of the original documentation by Gary Pang (CodeWritingCow) and new documentation we've written.

Requirements

Getting Started

Make sure you have pipenv set up on your machine.
Edit the contents of db.py to match the database you are trying to connect to.
Run pipenv install.
Run python scraper.py within a pipenv shell (or pipenv run python scraper.py).
When prompted, enter the 10-digit CIK number of a mutual fund.
Happy investing! ❤️ 💵 💰

Key Dependencies

Requests, Python library for making HTTP requests
lxml, Python library for processing XML and HTML
Beautiful Soup, Python library for scraping information from Web pages
re, Python module for using regular expressions
MySQL Python Connector, Python module for connecting to a MySQL database.

Contributor

References