Making Data, the DataMade Way
This is DataMade's guide to extracting, transforming and loading (ETL) data using Make, a common command line utility.
ETL refers to the general process of:
- taking raw source data (Extract)
- doing some stuff to get the data in shape, possibly involving intermediate derived files (Transform)
- & ultimately ending up with final output in a usable form (for Loading into something that consumes the data - be it an app, a system, a visualization, etc.)
For enthralling insights on how to get from source data to final output, all while minimizing future headaches - read on!
Principles
- Treat inputs as immutable - don't modify source data directly
- Be able to deterministically produce the final data with one command
- Write as little custom code as possible
- Use standard tools whenever possible
- Source data should be under version control
The Guide
- Make & Makefile Overview
- Why Use Make/Makefiles?
- Makefile 101
- Makefile 201 - Some Fancy Things Built Into Make
- ETL Styleguide
- Makefile Best Practices
- Variables
- Processors
- Standard Toolkit
- ETL Workflow Directory Structure
Code examples
- Some Annotated ETL Code Examples with Make
- EITC Works - adding data attributes to Illinois House and Senate district shapefiles and outputting at GeoJSON