Included in this repository is a set of tools and specifications for use at MSF's 401K Investor Resource Hackathon.
Please note that these scripts have been run on 10 March 2017, and the resulting data is included in csv, json, and mysql formats in respective directories within the repository as well.
For requests, tech questions, general comments, happy feedback, etc, feel free to use the wonderful Github tools provided here, or reach out via twitter to @mr_z_ro!
The data that's presented granularly below has also been collated into a MySQL data dump, which is included in the repository's mysql directory.
The data is also be hosted live in a location that can be accessed with credentials that will be announced at the hackathon.
The following sources were referenced in aggregating this data
- MorningStar: Aggregated list of top 20 funds holding PFE and GSK
- HoldingsChannel: Aggregated Lists of all Institutions (mislabeled on their site as “funds”) holding PFE and GSK, pulled from SEC’s EDGAR database
- ETFdb: List of ETFs holding PFE and GSK
- MutualFunds.com: List of all funds with corresponding abbreviations
All data provided in this repository at time of writing are for PFE and GSK stocks, which are in files prefixed with their respective tickers.
For use cases requiring fresh data beyond the hackathon, the scripts can be run on demand, after installing the following prerequisites.
The scripts require BeautifulSoup and Selenium libraries, which can be installed using pip as follows:
pip install bs4
pip install selenium
Next, in order to actually walk through the data, browser emulators are needed. PhantomJS is a great one that can be installed as follows:
Download phantomjs (for silent scraping):
http://phantomjs.org/download.html
[extract]
mv ~/Downloads/phantomjs-2.1.1-macosx/bin/phantomjs /usr/local/bin
Firefox (geckodriver) can also be helpful for debugging, and can be installed as follows:
Download geckodriver (for debugging):
https://github.com/mozilla/geckodriver/releases
[extract]
mv ~/Downloads/geckodriver /usr/local/bin
Note: please ensure PATH is updated to include /usr/local/bin
directory. An example of how to do this for a linux-based system (e.g. Mac, Ubuntu, or Windows with cygwin) can be found here
###Using the Tools
####scrape_ms.py
This script pulls data about the top mutual fund holders of a given stock (parameterized by TICKER) and dumps to a file called TICKER_mfund_holder.csv
. For instance, for Google (GOOG), this script can be run by calling:
python scrape_ms.py -t GOOG
Sample files for PFE and GSK have been provided as part of this repository.
####scrape_edb.py
This script pulls data about the top exchange-traded funds (ETFs) that hold a given stock (parameterized by TICKER) and dumps to a file called TICKER_etf_holder.csv
. For instance, for Yahoo (YHOO), this script can be run by calling:
python scrape_edb.py -t YHOO
Sample files for PFE and GSK have been provided as part of this repository.
####scrape_hc.py
This script pulls data about the top Institutions that hold a given stock (parameterized by TICKER) and dumps to a file called TICKER_inst_holder.csv
. For instance, for Yahoo (YHOO), this script can be run by calling:
python scrape_hc.py -t YHOO
Sample files for PFE and GSK have been provided as part of this repository.
####scrape_mf.py
This script pulls data about the ticker symbols of the top mutual funds, and dumps to a file called mfund_tickers.csv
. It can be run by calling:
python scrape_mf.py
####cleanup.sh This script cleans up logs and csvs produced by running the scrape files. It can be executed by running:
./cleanup.sh