/Financial-Statement-Data-Sets-ETL

Scalable scrapyd feed into AWS data pipeline; Spark ETL

Primary LanguagePython

Financial Statement Data ETL

Analysis of SEC DERA data

Overview

  • Python 3 Scrapy spider to retrieve SEC data
  • Scrapyd & scrapyd-client utilized for distributed crawling
    • resulting zip files uploaded to S3 bucket via scrapy feed export configuration
  • AWS infrastructure
    • Lambda function deployed via Cloud Formation
    • Serverless s3-uncompressor SAM repo to unzip files from one S3 bucket into another
  • Cloud Watch logging enabled
  • Scrapyd logging enabled

TODO

  • Incorporate Spark
    • Calculate metrics
    • Year over year growth of SEC ledger balances
    • Quarter over quarter
    • 3 Year growth
    • 5 Year growth
    • Export calculated results
  • Improve scalability
  • Architecture diagram
  • IAM authentication

Resources

Setup

Clone the repo

git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL

Deploy Lambda Function

AWS CLI Deploy
git checkout fork-aws-lambda
  • Follow README.md instructions (this is a copy of Piotr's forked repo)

OR

AWS Web Deploy
  • From the AWS Lambda page
  • Lambda > Create Function > Browse Serverless App Repository > Search: "s3-uncompressor"
  • Configure Source & Destination buckets
Troubleshooting
  • Adjust LambdaFunctionMemorySize and/or LambdaFunctionTimeout

Revert back to master branch

git checkout master

Create Virtualenv:

virtualenv venv
source venv/bin/activate

Install dependencies

pip install -r requirements.txt

Set environment variables

export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]

Run scrapy spider

Crawl all zip files in table
scrapy crawl sec_table

OR

Specify year as an argument
scrapy crawl args_spider -a year=2011

Distributed crawling

  • Launch a separate EC2 instance
Set environment variables
export AWS_ACCESS_KEY_ID=[xxxxxxxxxxxxxxxxxxx]
export AWS_SECRET_ACCESS_KEY=[xxxxxxxxxxxxxxxxxxxxxxxx]
Clone and install the dependencies
git clone https://github.com/phillipsk/Financial-Statement-Data-Sets-ETL.git
cd Financial-Statement-Data-Sets-ETL
pip install -r requirements.txt
Install scrapyd
pip install scrapyd
Run the scrapyd daemon
scrapyd

Revert back to the original instance

Checkout the distributed branch
git checkout feature-aws-distributed
Configure scrapy.cfg
  • Under the [deploy] property
    • Set the URL to the EC2 IP
    • Note the project name
    • By default Scrapyd runs on port 6800
url = http://x.x.x.x:6800/
project = secScrap
Install scrapyd-client
pip install scrapyd-client
Deploy the project
scrapyd-deploy default -p secScrap
curl http://x.x.x.x:6800/schedule.json -d project=secScrap -d spider=sec_table