S3 as a cost-efficient database.
- Database data storage is (relatively) expensive.
- S3 is really cheap!
- I can load all the data for a dataset in < 30 minutes using AWS lambda (and that's a conservative estimate). Previous naive automated attemps into PostgreSQL took half a day at best! More previous manual attempts took about a week.
Data is processed w/ (mostly) serverless pipeline.
First Lambda (dataupload.js, controlled by upload-control.js), loads data to a staging bucket in the cloud. Second Lambda (dataparse.js, controlled by parse-control.js), extracts data to s3 bucket.
This will blow through your Lambda free tier credits and probably rack up a few dollars of charges (<$5). But running hundreds of concurrent processes is fun so it might be worth it to you.
Assumes NodeJS 8+, NPM. Serverless via the Serverless Framework
git clone https://github.com/royhobbstn/s3-db.git
cd s3-db
npm install
serverless deploy
If not using an Amazon Cloud machine of some sort, you may need to set up a aws_key.json
file in the same format as aws_key.example.js
.
You'll then need to uncomment the lines in parse-control.js and upload-control.js marked CREDENTIALS.
Populate metadata bucket (prerequisite for running data):
node parse-acs-geofiles.js $year
node parse-acs-schemas.js $year
where $year
is one of (2014, 2015, 2016)
Step one is to upload Census Data into a Cloud staging bucket.
node upload-control $year
Step two is to parse that data into the desired format.
node parse-control $year
Bucket names are hardcoded (sorry).