IMPORTANT UPDATE on Apr 2, 2017: I've decided to stop maintaining this repo. The HTTP API (http://data.pystock.com) and the GitHub repo will be still accessible, only that latest data will stay at Mar 31, 2017 forever. However, I do offer paid service for maintaining this dataset. Please contact me at eliang.cs@gmail.com.
Historical data of US stocks, updated daily.
The data are collected using pystock-crawler and pystock-github. Every day before the US stock exchanges open at 9:30 EST/EDT, the crawler collects the stock prices and financial reports, and pushes the data to this repository. Thus, the latest data is of yesterday. The data are day-based, meaning that you won't find hourly or minute-level data here.
pystock-crawler
crawls data from the following sources:
- NASDAQ.com for company ticker symbols
- Yahoo Finance for stock prices
- SEC EDGAR for financial reports
All data are stored in CSV and TXT files, archived with gzip. The files are
categorized and named by their created dates. For example, a file collected on
2015-03-23 is named 20150323.tar.gz
placed under 2015
directory.
Initial data are the first batch of collected data, whose date range spans from
2009-01-01 to 2015-03-20. They are split into three files: 0001_initial.tar.gz
to 0003_initial.tar.gz
, under 2015
directory.
Every gzip archive file may or may not contain the following files:
symbols.txt
prices.csv
reports.csv
symbols.txt
is a list of companies, line by line. For example:
AAPL Apple Inc.
FB Facebook
GOOGL Google Inc.
prices.csv
contains daily prices in CSV format. Normally, in a
daily-generated file, two trading days of prices are included. This is because
when a company splits its shares, you will need to compare the close price
(close
) and the adjusted close price (adj_close
) of the previous trading
day to detect the split. See Yahoo's explaination
for more details.
Here's a sample of prices.csv
:
symbol,date,open,high,low,close,volume,adj_close
AAPL,2015-03-23,127.12,127.85,126.52,127.21,36761000,127.21
AAPL,2015-03-20,128.25,128.40,125.16,125.90,67941100,125.90
reports.csv
contains several financial metrics extracted from 10-Q or 10-K
reports. 10-Q is a quarterly report. 10-K is an annual report. The CSV file
looks like:
symbol,end_date,amend,period_focus,fiscal_year,doc_type,revenues,op_income,net_income,eps_basic,eps_diluted,dividend,assets,cur_assets,cur_liab,cash,equity,cash_flow_op,cash_flow_inv,cash_flow_fin
GOOG,2009-06-30,False,Q2,2009,10-Q,5522897000.0,1873894000.0,1484545000.0,4.7,4.66,0.0,35158760000.0,23834853000.0,2000962000.0,11911351000.0,31594856000.0,3858684000.0,-635974000.0,46354000.0
GOOG,2009-09-30,False,Q3,2009,10-Q,5944851000.0,2073718000.0,1638975000.0,5.18,5.13,0.0,37702845000.0,26353544000.0,2321774000.0,12087115000.0,33721753000.0,6584667000.0,-3245963000.0,74851000.0
GOOG,2009-12-31,False,FY,2009,10-K,23650563000.0,8312186000.0,6520448000.0,20.62,20.41,0.0,40496778000.0,29166958000.0,2747467000.0,10197588000.0,36004224000.0,9316198000.0,-8019205000.0,233412000.0
The financial metrics included are:
end_date
: the end date of the period of the filing reportamend
: is the filing an amendment?period_focus
: Q1, Q2, Q3 for quarterly reports, or FY for annual reportsfiscal_year
: fiscal year of the companydoc_type
: 10-Q or 10-K
Income statement
revenues
: revenues or salesop_income
: operating incomenet_income
: net income or net earningseps_basic
: basic earnings per shareeps_diluted
: diluted earnings per sharedividend
Balance sheet
assets
: total assetscur_assets
: current assetscur_liab
: current liabilitiescash
: cash and cash equivalentsequity
: total equity
Cash flow
cash_flow_op
: cash from operating activitiescash_flow_inv
: cash from investing activitiescash_flow_fin
: cash from financing activities
Besides the Github repo (https://github.com/eliangcs/pystock-data), you can also use the HTTP API provided by http://data.pystock.com.
This "website" is built with Github Pages and is a mirror of the git repo. If you're building an application on pystock-data, http://data.pystock.com is a better choice over the git repo, since not only is HTTP API easier to use, but you will also benefit from the fast CDN provided by Github Pages.
To navigate the filesystem on the HTTP API, you just visit index file:
http://data.pystock.com/index.txt
where you can find the index file for each year. For instance:
http://data.pystock.com/2016/index.txt
From there, you will find all the .tar.gz files for that year. For instance: