/sec-web-scraper-13f

CPSC 437 Final Project | Web scraper for pulling SEC 13F filings and inserting them into mySQL database.

Primary LanguagePythonMIT LicenseMIT

Yale University CPSC 437 Database SEC Python Web Scraper

This repository contains a Python Web scraper for parsing NPORT-P filings (fund holdings) from SEC's website, EDGAR. We (Jason Wu, Michael Lewkowicz, Kevin Zhang - Yale undergrads) forked this repository from CodeWritingCow.

In this fork, we modified the original code to work with the new Edgar website (as of Dec 5th, 2020). These modifications were very signficant. If you look at our web scraper next to the original repo, the code is almost completely different, though we re-use some of the original helper functions written by CodeWritingCow stored in helper.py. In addition, we have made the following modifications:

  • Exclusively target and scrape NPORT-P filings
  • Collect issuers, total value at the time of filing, number of shares, and other relevant data
  • Directly insert this data into a mySQL database instead of a tsv file

In addition, note that the documentation is a mix of the original documentation by Gary Pang (CodeWritingCow) and new documentation we've written.

Requirements

Getting Started

  • Make sure you have pipenv set up on your machine.
  • Edit the contents of db.py to match the database you are trying to connect to.
  • Run pipenv install.
  • Run python scraper.py within a pipenv shell (or pipenv run python scraper.py).
  • When prompted, enter the 10-digit CIK number of a mutual fund.
  • Happy investing! ❤️ 💵 💰

Key Dependencies

  • Requests, Python library for making HTTP requests
  • lxml, Python library for processing XML and HTML
  • Beautiful Soup, Python library for scraping information from Web pages
  • re, Python module for using regular expressions
  • MySQL Python Connector, Python module for connecting to a MySQL database.

Contributor

References