Financial Statement Data ETL

Analysis of SEC DERA data

Overview

Python 3 Scrapy spider to retrieve SEC data
Scrapyd & scrapyd-client utilized for distributed crawling
- resulting zip files uploaded to S3 bucket via scrapy feed export configuration
AWS infrastructure
- Lambda function deployed via Cloud Formation
- Serverless s3-uncompressor SAM repo to unzip files from one S3 bucket into another
Cloud Watch logging enabled
Scrapyd logging enabled

git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL

git checkout fork-aws-lambda

From the AWS Lambda page
Lambda > Create Function > Browse Serverless App Repository > Search: "s3-uncompressor"
Configure Source & Destination buckets

git checkout master

virtualenv venv
source venv/bin/activate

pip install -r requirements.txt

export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]

scrapy crawl sec_table

scrapy crawl args_spider -a year=2011

export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]

git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL
pip install -r requirements.txt

pip install scrapyd

scrapyd

git checkout feature-aws-distributed

Under the [deploy] property
- Set the URL to the EC2 IP
- Note the project name
- By default Scrapyd runs on port 6800

url = http://x.x.x.x:6800/
project = secScrap

pip install scrapyd-client

scrapyd-deploy default -p secScrap

curl http://x.x.x.x:6800/schedule.json -d project=secScrap -d spider=sec_table