AWS Glue versions 3.0 and up are natively bundled with the dependencies required to run a version of Apache Iceberg. To use the natively bundled version of iceberg that is included with Glue reference the Using the Iceberg framework in AWS Glue documentation.
Additionally, AWS Glue offers market place connectors for Apache Iceberg which can afford you access to different versions of Iceberg.
If you do not want to use the natively bundled version of iceberg and you do not want to use market place connectors - you can configure a Glue job to use iceberg from JAR files directly downloaded from the Apache Iceberg release page. This method allows the most flexibility with respect to choosing an iceberg version but requires so additional configurations.
The instructions below will provide step by step instructions on how you can use iceberg in your Glue job via. dependent JAR files
There are two ways to run the example either via. a CloudFormation stack or going step by step in the AWS console
Click on the button below to deploy a CloudFormation stack.
The stack will create an S3 bucket with the required JARs files and a Glue job. After the deployment of the CloudFormation stack open the Glue console, click on the job and the run the job
- Identify and download the iceberg JAR file
The release page in the iceberg documentation allows you to download the JAR files for iceberg. Each version of iceberg has multiple JAR files avaiable for download. It is important we download the correct JARs.
We want to download the JAR files associated with spark. Ignore the JAR files for flink and hive. For spark, iceberg has different JARs depending on the version of spark. The version of Spark run by Glue is determined by the version of Glue you chose. Use AWS Glue version documentation to determine which version of spark the Glue job will use. Once you determine the version of Spark download the corresponding JAR file for iceberg.
An example. If I am using Glue 4.0 the AWS documentation informs us that Glue 4.0 uses Spark version 3.3 . Consequently, I would download the download the iceberg JAR for spark that corresponds with spark version 3.3
- Identify and download the aws bundle JAR file
The release page in the iceberg documentation also includes the option to download an aws-bundle
JAR file download the aws-bundle
JAR file that corresponds with the version of the iceberg JAR you downloaded
An example. Working with Iceberg version 1.5.2 and AWS Glue 4.0 I would download the following JARS
- Upload the JAR files to S3
Upload both of the JAR files to S3
- Create Glue Data Catalog Database
Navigate to the AWS Glue home page in the AWS console. Under the Glue Data Catalog section, select databases and click on add database
Name the database iceberg
- Create and configure a Glue job
Navigate to Glue studio and create a new Spark job via. script editor
After configuring the standard aspects of a Glue job such as choosing an IAM role, renaming and saving the job. Navigate to the job details button, specifically open the advanced properties section, then navigate to libraries sub-section.
Update the dependent JARs path section with the URI of each JAR file separate the two URIs with a comma no spaces.
For example, if I have my JAR files in a S3 bucket named <example-bucket>
I would enter the following into the dependent JARs path section of the Glue job details
s3://<example-bucket>/jars/iceberg-aws-bundle-1.5.2.jar,s3://<example-bucket>/jars/iceberg-spark-runtime-3.3_2.12-1.5.2.jar
The example configuration is pictured below
Additonally add a job --s3_bucket_name
with the name of the S3 bucket that you want the job to write the sample iceberg data to
- Add sample code to Glue job
Copy and paste the code from the sample_job.py into the Glue script section
- Save and run the Glue job