/sagemaker-spark-container

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.

Primary LanguagePythonApache License 2.0Apache-2.0

SageMaker Spark Container

Spark Overview

Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

SageMaker Spark Container

The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.

For the list of available Spark images, see Available SageMaker Spark Images.

License

This project is licensed under the Apache-2.0 License.

Usage in the SageMaker Python SDK

The simplest way to get started with the SageMaker Spark Container is to use the pre-built images via the SageMaker Python SDK.

Amazon SageMaker Processing — sagemaker 2.5.3 documentation

Getting Started With Development

To get started building and testing the SageMaker Spark container, you will have to setup a local development environment.

See instructions in DEVELOPMENT.md

Contributing

To contribute to this project, please read through CONTRIBUTING.md