/merlin-on-vertex

coded examples for scaling ML workloads using NVIDIA's merlin and nvtabular ecosystems and Google Cloud's Vertex AI platformAI

Primary LanguageJupyter Notebook

merlin-on-vertex

scaling deep retrieval workloads using NVIDIA's Merlin & NVTabular on Google Cloud's Vertex AI platform

Vertex AI is Google Cloud's unified Machine Learning platform to help data scientists and machine learning engineers increase experimentation, deploy faster, and manage models with confidence.

NVIDIA Merlin is an open-source framework for building large-scale deep learning recommender system.

alt text

See this repo for a sample development workflow


Repo structure

  • launch Vertex pipeline to orchestrate GPU-based data preprocessing with NVTabular
  • build two-tower encoder with Merlin Models,
  • prepare training application (container/image) with Cloud Build
  • scale training with Vertex AI and A100 GPU
  • Use candidate embeddings generated from Vertex Train job to create Matching Engine serving index
  • create and deploy ANN and brute-force indexes
  • compute recall and retrieval latency
  • orchestrate e2e model training and deployemnt (notebooks 02 and 03)
  • create candidate index and deploy to index endpoint with Vertex AI Matching Engine
  • using trained towers and deployed Matching Engine index, generate playlist recommendations for your own (or any public) Spotify playlist(s)

The Python modules are in the src folder:

  • src/preprocessor - data preprocessing utility functions and classes
  • src/process_pipes - vertex pipeline components for orchestrating data preprocessing
  • src/serving/app - deployment and serving utility functions and classes
  • src/train_pipes - vertex pipeline components for orchestrating the training and deployment pipeline
  • src/trainer - model definitions and training application

Objectives

Deploying trained models and serving predictions with Vertex Prediction (Triton server coming soon)

TODOs


The dataset

Spotify's Million Playlist Dataset (MPD) - see here for downloading and preparing the dataset in BigQuery

Ching-Wei Chen, Paul Lamere, Markus Schedl, and Hamed Zamani. Recsys Challenge 2018: Automatic Music Playlist Continuation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys ’18), 2018.


Data preprocessing pipeline

alt text

  • Create and save training and validation splits
  • Store data split statistics and schema (NVTabular's workflow)
  • Transform data splits and prepare them for training and serving tasks
  • Orchestrate these NVTabular pipelines with Vertex Managed Pipelines
  • Scale pipeline processing tasks with single or multiple GPU configurations

With 4 Tesla T4 GPUs per processing component, pipeline processes our Spotify MPD in ~27 minutes


Training -> Deployment pipeline

alt text

  • Build custom containers for training and serving
  • Train Merlin retrieval model
  • Import Query and Candidate Towers to pipeline DAG
  • Register and deploy Query Tower with Vertex AI
  • Create Matching Engine indexes amd index endpoints
  • Deploy indexes to index endpoints