This repository contains a Python Web scraper for parsing NPORT-P filings (fund holdings) from SEC's website, EDGAR. We (Jason Wu, Michael Lewkowicz, Kevin Zhang - Yale undergrads) forked this repository from CodeWritingCow.
In this fork, we modified the original code to work with the new Edgar website (as of Dec 5th, 2020). These modifications were very signficant. If you look at our web scraper next to the original repo, the code is almost completely different, though we re-use some of the original helper functions written by CodeWritingCow stored in helper.py
. In addition, we have made the following modifications:
- Exclusively target and scrape NPORT-P filings
- Collect issuers, total value at the time of filing, number of shares, and other relevant data
- Directly insert this data into a mySQL database instead of a tsv file
In addition, note that the documentation is a mix of the original documentation by Gary Pang (CodeWritingCow) and new documentation we've written.
- Make sure you have
pipenv
set up on your machine. - Edit the contents of
db.py
to match the database you are trying to connect to. - Run
pipenv install
. - Run
python scraper.py
within apipenv shell
(orpipenv run python scraper.py
). - When prompted, enter the 10-digit CIK number of a mutual fund.
- Happy investing! ❤️ 💵 💰
- Requests, Python library for making HTTP requests
- lxml, Python library for processing XML and HTML
- Beautiful Soup, Python library for scraping information from Web pages
- re, Python module for using regular expressions
- MySQL Python Connector, Python module for connecting to a MySQL database.