Practical Machine Learning at scale with Serverless Spark on GCP and Vertex AI

1. About

This repo is a hands on lab for Spark MLlib based scalable machine learning on Google Cloud, powered by Dataproc Serverless Spark and showcases integration with Vertex AI AIML platform. The focus is on demystifying the products and integration (and not about a perfect model), and features a minimum viable end to end machine learning use case.


2. Format & Duration

The lab is fully scripted (no research needed), with (fully automated) environment setup, data, code, commands, notebooks, orchestration, and configuration. Clone the repo and follow the step by step instructions for an end to end MLOps experience.

Expect to spend ~8 hours to fully understand and execute if new to GCP and the services and at least ~6 hours otherwise.


3. Level

L300 - framework (Spark), services/products, integration


4. Audience

The intended audience is anyone with (access to Google Cloud and) interest in the usecase, products and features showcased.


5. Prerequisites

Knowledge of Apache Spark, Machine Learning, and GCP products would be beneficial but is not entirely required, given the format of the lab. Access to Google Cloud is a must unless you want to just read the content.


6. Goal

Simplify your learning and adoption journey of our product stack for scalable data science with -

  1. Just enough product knowledge of Dataproc Serverless Spark & Vertex AI integration for machine learning at scale on Google Cloud
  2. Quick start code for ML at scale with Spark that can be repurposed for your data and ML experiments
  3. Terraform for provisioning a variety of Google Cloud data services in the Spark ML context, that can be repurposed for your use case

7. Use case covered

Telco Customer Churn Prediction with a Kaggle dataset and Spark MLLib, Random Forest Classifer


8. Solution Architecture

8.1. Experimenting with Spark model training, tuning and batch scoring

README


About Dataproc Serverless Spark Interactive: Fully managed, autoscalable, secure Spark infrastructure as a service for use with Jupyter notebooks on Vertex AI Workbench managed notebooks. Use as an interactive Spark IDE, for accelerating development and speed to production.

8.2. Operationalizing Spark Model Training

README


About Dataproc Serverless Spark Batches: Fully managed, autoscalable, secure Spark jobs as a service that eliminates administration overhead and resource contention, simplifies development and accelerates speed to production. Learn more about the service here.

  • Find templates that accelerate speed to production here
  • Want Google Cloud to train you on Serverless Spark for free, reach out to us here
  • Try out our other Serverless Spark centric hands on labs here

8.3. Operationalizing Spark Batch Scoring

There are multiple options.

8.3.1. Directly from within Spark

README


8.3.2. Through Vertex AI serving

Vertex AI supports operationalizing batch serving of Spark ML Models in conjunction with MLEAP.
ARCHITECTURE DIAGRAM TO BE ADDED
CODE MODULE - Work in progress


9. Flow of the lab

README


For your convenience, all the code is pre-authored, so you can focus on understanding product features and integration.


10. The lab modules

Complete the lab modules in a sequential manner. For a better lab experience, read all the modules and then start working on them.

# Module Duration
01 Terraform for environment provisioning 1 hour
02 Tutorial on Dataproc Serverless Spark Interactive Sessions for authoring Spark code 15 minutes
03 Author PySpark ML experiments with Serverless Spark Interactive notebooks 1 hour
04 Author PySpark ML scripts in preparation for authoring a model training pipeline 1 hour
05 Author a Vertex AI model training pipeline 1 hour
06 Author a Cloud Function that calls your Vertex AI model training pipeline 15 minutes
07 Create a Cloud Scheduler job that invokes the Cloud Function you created 15 minutes
08 Author a Cloud Composer Airflow DAG for batch scoring and schedule it 15 minutes

The lab includes custom container image creation and usage.

11. Dont forget to

Shut down/delete resources when done to avoid unnecessary billing.


12. Credits

# Google Cloud Collaborators Contribution
1. Anagha Khanolkar Creator
2. Dr. Thomas Abraham
Brian Kang
ML consultation, testing, best practices and feedback
3. Rob Vogelbacher
Proshanta Saha
ML consultation
4. Ivan Nardini
Win Woo
ML consultation, inspiration through samples and blogs

The source code was evolved by the creator from a base developed by a partner for Google Cloud.


13. Contributions welcome

Community contribution to improve the lab is very much appreciated.


14. Getting help

If you have any questions or if you found any problems with this repository, please report through GitHub issues.


15. Release History

Date Details
20220930 Added serializing model to MLEAP bundle
Affects:
1. Terraform main.tf
2. Hyperparameter tuning notebook
3. Hyperparameter tuning PySpark script
4. VAI pipeline notebook
5. VAI Json Template
20221202 Added a Python SDK (notebook) sample for preprocessing