Is a framework that lets you easily create spark ETL jobs using simple configuration files(YAML/JSON)
Prerequisite
Requires JDK8
How to create a new ELT job using dataflow?
To create a new ETL job, define 2 YAML configuration.
The first configuration file consists the following information :
- Input sources(eg. CSV,JSON, parquet, Azure, Kafka, JDBC)
- Transformation definition file
- Output sinks(eg. CSV, JSON, parquet, Redshift, JDBC, Kafka, GCS, Azure)
For example a configuration that reads form input source Azure Blob storage, defines a transformation file and wirtes to an Output sink is as follows:
appName: CovidCaseTracker
inputs:
covid:
azure:
storageType: AzureBlobStorage
containerName: public
storageAccountName: pandemicdatalake
blob_sas_token: ""
blob_relative_path: curated/covid-19/ecdc_cases/latest/ecdc_cases.csv
format: csv
options:
quoteAll: false
transformations:
- /config/covid_tracking_transformations.yaml
output:
file:
dir: gs://dataflow_covid_data
The second configuration file(referenced above) defines the transformations that you want to apply on your data Here we apply Spark SQL transformations on covid dataframes defined in the first config file. And the new dataframes are writen into GCPBigQuery sink
steps:
- dataFrameName: casesInUS
sql:
SELECT *
FROM covid
WHERE COUNTRIES_AND_TERRITORIES='United_States_of_America'
- dataFrameName: casesInCanada
sql:
SELECT *
FROM covid
WHERE COUNTRIES_AND_TERRITORIES='Canada'
output:
outputType: GCPBigQuery
outputOptions:
saveMode: Append
dbTable: dataflow.casesInUS
- dataFrameName: casesInCanada
outputType: GCPBigQuery
outputOptions:
saveMode: Append
dbTable: dataflow.casesInCanada
Running the test cases:
For more samples checkout out test configs
Acknowlegement
This project is build on top of a lightweight version of the amazing opensource spark ETL framework metorikku. Check it out if you require more advanced features like instrumentation.