/docker-spark-k8s-aws

Docker image for running Spark 3 on Kubernetes on AWS

spark-k8s-aws

Build for an Apache Spark on kubernetes-ready docker image configured with notable AWS Dependencies, including:

Build the Docker Image

Builds are managed using https://earthly.dev

earthly --use-inline-cache +build-spark-image

Use in your own Earthfile build:

my-image:
  FROM +github.com/viaduct-ai/docker-spark-k8s-aws+build-spark-image
  # ...

Why?

If you've ever tried building a spark distribution/image with the AWS Glue Data Catalog Client for Hive, you know it's a PITA.

This project aims to open source a working docker image, built using the amazing Earthly tool, to democratize a more integrated Apache Spark on Kubernetes on AWS experience until someone develops a Spark DataSourceV2 API-compliant Glue Data Catalog implementation (instead of this absolute hack of patching hive and building spark from source)

Many thanks to @bbenzikry for open sourcing their solution to build Spark 3 + Glue compatible docker images. This project builds on their work.