This is a PoC about using linkhttps://docs.delta.io/latest/index.html[Delta IO] in an AWS Glue ETL. I’m using the Scala approach, and my goal is to experiment with Delta on S3, and see how it fits with AWS Athena and AWS Redshift Spectrum.
In order to build the project maven`is required. Compile and get the jar binary with `mvn compile
.
Finally you can run it using the following command:
mvn exec:java -Dexec.mainClass="GlueApp" \
-Dexec.args="--JOB-NAME delta-play-job --RAW_DATA_LOCATION <raw data path> --DELTA_LOCATION <delta table path>"
Try it locally, for example, using sample data. Download it, and set the local path location as RAW_DATA_LOCATION and point DELTA_LOCATION anywhere you want to get the Delta storage.
mvn exec:java -Dexec.mainClass="GlueApp" -Dexec.args="--JOB-NAME delta-play-job --RAW_DATA_LOCATION /sample-data/raw/user --DELTA_LOCATION /sample-data/delta/users"
It is possible to play around using Zeppelin:
export REPO_PATH=<repo clone location>
export ZEPPELIN_DATA=~/Downloads/zeppelin-data
export SPARK_HOME=~/Downloads/aws-spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
docker run --net host --rm -v $REPO_PATH/notebooks:/notebook \
-v $SPARK_HOME:/spark -v $ZEPPELIN_DATA:/data \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
--user 1000 --name zeppelin apache/zeppelin:0.9.0
Enjoy It!
Upload ./src/main/scala/DeltaPlayGlueApp.scala to Glue Job script config. Add Glue Job parameters for the mentioned RAW_DATA_LOCATION and DELTA_LOCATION.
This code has been tested on Glue 1.0. Here you would find about it specification.
The following command creates a external table(for the mentioned sample data) using athena visible from Glue Data Catalog:
aws athena --profile pa start-query-execution --query-string "
CREATE EXTERNAL TABLE default.users_delta (
registration_dttm TIMESTAMP,
id int
first_name string,
last_name string,
email string,
gender string,
ip_address string,
cc string,
country string,
birthdate string,
salary double,
title string,
comments string
)
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://<bucket-name>/data/delta/users/_symlink_format_manifest/'
tblproperties ('parquet.compress'='SNAPPY');" \
--result-configuration "OutputLocation=s3://<bucket-name>/athena/output/"