/delta-glue-app

PoC using io.delta in an AWS Glue App

Primary LanguageScala

DeltaApp AWS-Glue Application

badge

Abstract

This is a PoC about using linkhttps://docs.delta.io/latest/index.html[Delta IO] in an AWS Glue ETL. I’m using the Scala approach, and my goal is to experiment with Delta on S3, and see how it fits with AWS Athena and AWS Redshift Spectrum.

Build

In order to build the project maven`is required. Compile and get the jar binary with `mvn compile.

Finally you can run it using the following command:

mvn exec:java -Dexec.mainClass="GlueApp" \
-Dexec.args="--JOB-NAME delta-play-job --RAW_DATA_LOCATION <raw data path> --DELTA_LOCATION <delta table path>"

Try-It

Try it locally, for example, using sample data. Download it, and set the local path location as RAW_DATA_LOCATION and point DELTA_LOCATION anywhere you want to get the Delta storage.

mvn exec:java -Dexec.mainClass="GlueApp" -Dexec.args="--JOB-NAME delta-play-job --RAW_DATA_LOCATION /sample-data/raw/user --DELTA_LOCATION /sample-data/delta/users"

It is possible to play around using Zeppelin:

export REPO_PATH=<repo clone location>
export ZEPPELIN_DATA=~/Downloads/zeppelin-data
export SPARK_HOME=~/Downloads/aws-spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8

docker run --net host --rm -v $REPO_PATH/notebooks:/notebook \
-v $SPARK_HOME:/spark -v $ZEPPELIN_DATA:/data \
-e ZEPPELIN_NOTEBOOK_DIR='/notebook' \
--user 1000 --name zeppelin apache/zeppelin:0.9.0

Enjoy It!

AWS-Glue execution

Upload ./src/main/scala/DeltaPlayGlueApp.scala to Glue Job script config. Add Glue Job parameters for the mentioned RAW_DATA_LOCATION and DELTA_LOCATION.

AWS Glue version

This code has been tested on Glue 1.0. Here you would find about it specification.

AWS Data Catalog

The following command creates a external table(for the mentioned sample data) using athena visible from Glue Data Catalog:

aws athena --profile pa start-query-execution --query-string "
 CREATE EXTERNAL TABLE default.users_delta (
    registration_dttm TIMESTAMP,
    id int
    first_name string,
    last_name string,
    email string,
    gender string,
    ip_address string,
    cc string,
    country string,
    birthdate string,
    salary double,
    title string,
    comments string
  )
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://<bucket-name>/data/delta/users/_symlink_format_manifest/'
  tblproperties ('parquet.compress'='SNAPPY');"  \
  --result-configuration "OutputLocation=s3://<bucket-name>/athena/output/"