Analysis of SEC DERA data
- Python 3 Scrapy spider to retrieve SEC data
- Scrapyd & scrapyd-client utilized for distributed crawling
- resulting zip files uploaded to S3 bucket via scrapy feed export configuration
- AWS infrastructure
- Lambda function deployed via Cloud Formation
- Serverless s3-uncompressor SAM repo to unzip files from one S3 bucket into another
- Cloud Watch logging enabled
- Scrapyd logging enabled
- Incorporate Spark
- Calculate metrics
- Year over year growth of SEC ledger balances
- Quarter over quarter
- 3 Year growth
- 5 Year growth
- Export calculated results
- Improve scalability
- Architecture diagram
- IAM authentication
- Data Location: https://www.sec.gov/dera/data/financial-statement-data-sets.html
- Data Dictionary: https://www.sec.gov/files/aqfs.pdf
git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL
git checkout fork-aws-lambda
- Follow README.md instructions (this is a copy of Piotr's forked repo)
- From the AWS Lambda page
- Lambda > Create Function > Browse Serverless App Repository > Search: "s3-uncompressor"
- Configure Source & Destination buckets
- Adjust LambdaFunctionMemorySize and/or LambdaFunctionTimeout
git checkout master
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]
scrapy crawl sec_table
scrapy crawl args_spider -a year=2011
- Launch a separate EC2 instance
export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]
git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL
pip install -r requirements.txt
pip install scrapyd
scrapyd
git checkout feature-aws-distributed
- Under the
[deploy]
property- Set the URL to the EC2 IP
- Note the project name
- By default Scrapyd runs on port 6800
url = http://x.x.x.x:6800/
project = secScrap
pip install scrapyd-client
scrapyd-deploy default -p secScrap
curl http://x.x.x.x:6800/schedule.json -d project=secScrap -d spider=sec_table