This is a repository of examples, templates, and informative links which guide SageMaker Studio users to manage clusters and run EMR workloads in conjunction with SageMaker ML tasks. Typically, these workloads involve utilization of Apache Spark, but we'll also show integrations with other analytic libraries like PyHive and Presto.
SageMaker Studio provides users the ability to visually browse and connect to Amazon EMR clusters right from the Studio notebook. Additionally, you can now create, stop, and manage EMR clusters directly from Studio.
For more information and examples see the Example EMR Templates' README
SageMaker Studio supports interactive EMR processing through a graphical and programmatic way of connecting to existing EMR clusters. Several kernels include the SageMaker Studio Analytics Extension for seamless EMR connectivity and generating pre-signed SparkUI links for debugging.
User's can leverage the SparkMagic kernels for interactively working with remote Spark clusters through Livy or libraries such as PyHive can be used after connection to the cluster has been established.
Lastly, we show examples of locally running Spark within SageMaker Studio notebooks since this is often done during while prototyping prior to standing up an EMR cluster.
For more information and examples see the Interactive Spark directory's README
For more information and examples see the Submitting Spark Job directory's README
We've created a guided workshop for users to become familiar with SageMaker Studio's EMR integration.
For more information see the Workshop directory's README
- What is the sagemaker-spark repository and how does it relate?
- How can I run local spark testing within SageMaker Studio notebooks?
- How can I interact with AWS Glue from SageMaker Studio?
We utilize black for .py
and .ipynb
formatting in this repository.