/serverless-spark-workshop

Serverless Spark Hands-On Labs

Primary LanguagePythonApache License 2.0Apache-2.0

Serverless Spark Hands-On Workshop

Apache Spark is often used for interactive queries, machine learning, and real-time workloads.

Spark developers are typically spending only 40% of time writing code while spending 60% tuning infrastructure and managing clusters.

Google Cloud customers have used our auto-scaling, serverless Spark to boost productivity and reduce infrastructure costs.

This repository contains Serverless Spark on GCP hands-on labs built around common use cases. By doing these labs, data engineers and data scientists with Apache Spark experience will ramp up faster on Serverless Spark on GCP.

Check out this repository for Dataproc Serverless ready-to-use, config driven Spark templates for solving simple, but large, in-Cloud data tasks, including data import/export/backup/restore and bulk API operations.

Feedback From Serverless Spark Users

  • "Serverless Spark is so much easier than traditional cluster based products."
    ~ Director of Data Science at business management corporation

  • "Anytime we can go the serverless route we will. Just so much simpler and eliminates the management of the infrastructure."
    ~ Director of Data Engineering at business management corporation

  • “Serverless Spark enables us to only use the compute resources we need when we need them and all with a single click. The Spark Workshop is a great way to get hands on experience with the tools.”
    ~ Principal Data Scientist at multinational retail corporation

  • “We ran a compute-intensive Serverless Spark query in 19 mins. That same Spark query took 90 mins on a traditional cluster based product. It's ~80% faster on Serverless Spark.”
    ~ Principal Architect at multinational retail corporation

What's Covered?

# Modules Focus Feature
1 Lab 1 - Cell Tower Anomaly Detection Data Engineering Serverless Spark Batch from CLI with Cloud Composer orchestration
2 Lab 2 - Wikipedia Page View Analysis Data Analysis Serverless Spark Batch from BigQuery UI
3 Lab 3 - Chicago Crimes Analysis Data Analysis Serverless Spark Interactive from Vertex AI managed notebook
4 Lab 4 - Retail Store Analytics Data Analysis Serverless Spark Batch from CLI with Cloud Composer orchestration and Dataproc Metastore
5 Lab 5 - Serverless Spark Streaming Data Analysis Serverless Spark Dataproc Batches
6 Lab 6 - Timeseries Forecasting Data Analysis Vertex AI notebooks with Serverless Spark session
7 Lab 7 - COVID-19 Economic Impact Data Analysis Vertex AI notebooks with Serverless Spark session
8 Lab 8 - Malware Detection Data Analysis Serverless Spark Batch from CLI with Cloud Composer orchestration
9 Lab 9 - Social Media Data Analytics Data Analysis Vertex AI notebooks with Serverless Spark session

Credits

Some of the labs are contributed by Google Cloud partners or by Googlers.
Lab 1 - TEKsystems
Lab 2 - TEKsystems
Lab 3 - Anagha Khanolkar (@anagha-google)
Lab 4 - TEKsystems
Lab 5 - TEKsystems
Lab 6 - TEKsystems
Lab 7 - TEKsystems
Lab 8 - TEKsystems
Lab 9 - TEKsystems

Contributing

See the contributing instructions to start contributing.

License

All solutions within this repository are provided under the Apache 2.0 license. Please see the LICENSE file for more detailed terms and conditions.

Disclaimer

This repository and its contents are not an official Google Product.

Contact

Interested in doing a guided, hands-on Spark Workshop? Please fill out this form.