Recommendations at "Reasonable Scale": joining dataOps with deep learning recSys with Merlin and Metaflow
July 2022: this is a WIP, come back often for updates, a blog post and my NVIDIA talk (FORTHCOMING)!
This project is a collaboration with the Outerbounds, NVIDIA Merlin and Comet teams, in an effort to release as open source code a realistic data and ML pipeline for cutting edge recommender systems "that just works". Anyone can cook do great ML, not just Big Tech, if you know how to pick and choose your tools.
TL;DR: (after setup) a single ML person is able to train a cutting edge deep learning model (actually, several versions of it in parallel), test it and deploy it without any explicit infrastructure work, without talking to any DevOps person, without using anything that is not Python or SQL.
As a use case, we pick a popular RecSys challenge, user-item recommendations for the fashion industry: given the past purchases of a shopper, can we train a model to predict what he/she will buy next? In the current V1.0, we target a typical offline training, cached predictons setup: we prepare in advance the top-k recommendations for our users, and store them in a fast cache to be served when shoppers go online.
Our goal is to build a pipeline with all the necessary real-world ingredients:
- dataOps with Snowflake and dbt;
- training Merlin models on GPUs, in parallel, leveraging Metaflow;
- advanced testing with Reclist (FORTHCOMING);
- serving cached prediction through FaaS and SaaS (AWS Lambda, DynamoDb, the serverless framework).
At a quick glance, this is what we are building:
For an in-depth explanation of the philosophy behind the approach, please check the companion blog post (FORTHCOMING).
If you like this project please add a star on Github here and check out / share / star the RecList package.
This project builds on our open roadmap for "MLOps at Resonable Scale", automated documentation of pipelines, rounded evaluation for RecSys:
- NEW: Upcoming CIKM RecSys Evaluation Challenge;
- RecList (project website);
- You don't need a bigger boat (repo, paper, talk);
- Post-Modern Stack (repo);
- DAG Cards are the new model cards (NeurIPS paper).
The code is a self-contained, end-to-end recommender project; however, since we leverage best-in-class tools, some preliminary (one time) setup is required. Please make sure the requirements are satisfied, depending on what you wish to run and on what you are already using - roughly in order of ascending complexity:
The basics: Metaflow, Snowflake and dbt
A Snowflake account is needed to host the data, and a working Metaflow setup is needed to run the flow on AWS GPUs if you wish to do so:
- Snowflake account: sign-up for a free trial.
- AWS account: sign-up for a free AWS account.
- Metaflow on AWS: follow the setup guide - in theory the pipeline should work also with a local setup (i.e. no additional work after installing the
requirements
), if you don't need cloud computing. However, we strongly recommend a fully AWS-compatible setup. The current flow has been tested with Metaflow out-of-the-box (no config, all local), Metaflow with AWS data store but all local computing, and Metaflow with AWS data store and AWS Batch with GPU computing. - dbt core setup: on top of installing the package in
requirements.txt
, you need to properly configure your dbt_profile.
Adding experiment tracking
- Comet ML: sign-up for free and get an api key. If you don't want experiment tracking, make sure to comment out the Comet specific parts in the
train_model
step.
Adding PaaS deployment
- AWS Lambda setup: if the env
SAVE_TO_CACHE
is set to1
, the Metaflow pipeline will try and cache in dynamoDB recommendations for the users in the test set. Those recommendations can be served through an endpont using AWS Lambda. If you wish to serve your recommendations, you need to run the serverless project in theserverless
folder before running the flow: the project will create both a DynamoDB table and a working GET endpoint. To do so: first, install the serverless framework and connect it with your AWS; second, cd into theserverless
folder, and runAWS_PROFILE=tooso serverless deploy
(whereAWS_PROFILE
selects a specific AWS config with permission to run the framework, and can be omitted if you use your default). If all goes well, the CLI will create the relevant resources and print out the URL for your public rec API, e.g.endpoint: GET - https://xafacoa313.execute-api.us-west-2.amazonaws.com/dev/itemRecs
: you can verifiy the endpoint is working by pasting the URL in the browser (response will be empty as you need to run the flow to populate dynamoDB). Make sure the region of deployment in theserverless.yml
file is the same as the one in the Metaflow pipeline. Note that while we use the serverless framework for convenience, the same setup can be done manually, if preferred.
A note on containers
At the moment of writing, Merlin does not have an official ECR, so we pulled the following image:
nvcr.io/nvidia/merlin/merlin-tensorflow-training:22.05
and slightly changed the entry point to work with Metaflow. The docker
folder contains the relevant files - the current flow uses a public ECR repository we prepared on our AWS (public.ecr.aws/b3x2d2n0/metaflow_merlin
) when running training in BATCH; if you wish to use your own ECR or the repo above becomes unavailable for whatever reason, you can just change the relevant image
parameter in the flow.
We recommend using python 3.8 for this project.
Setup a virtual environment with the project dependencies:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Note that if you never plan on running Merlin's training locally, but only through AWS Batch, you can avoid installing merlin and tensorflow libraries to run the flow.
Inside src
, create a version of the local.env
file named only .env
(do not commit it!), and fill its values:
VARIABLE | TYPE (DEFAULT) | MEANING |
---|---|---|
SF_USER | string | Snowflake user name |
SF_PWD | string | Snowflake password |
SF_ACCOUNT | string | Snowflake account |
SF_DB | string | Snowflake database |
SF_ROLE | string | Snowflake role to run SQL |
EN_BATCH | 0-1 (0) | Enable cloud computing for Metaflow |
COMET_API_KEY | string | Comet ML api key |
SAVE_TO_CACHE | 0-1 (0) | Enable storing predictions to an external cache for serving. If 1, you need to deploy the AWS Lambda (see above) before running the flow |
The original dataset is from the H&M data challenge.
- Download the files
articles.csv
,customers.csv
,transactions_train.csv
and put them in thesrc/data
folder. - Run
upload_to_snowflake.py
as a one-off script: the program will dump the dataset to Snowflake, using a typical modern data stack pattern. This allows us to use dbt and Metaflow to run a realistic ELT and ML code.
Once you run the script, check your Snowflake for the new tables:
After the data is loaded, we use dbt as our transformation tool of choice. While you can run dbt code as part of a Metaflow pipeline, we keep the dbt part separate in this project to simplify the runtime component: it will be trivial (as shown here for example) to orchestrate the SQL code within Metaflow if you wish to do so. After the data is loaded in Snowflake:
cd
into thedbt
folder;- run
dbt run
.
Check your Snowflake for the new tables created by dbt:
In particular, the table "EXPLORATION_DB"."HM_POST"."FILTERED_DATAFRAME"
represents a dataframe in which user, article and transaction data are all joined together - the Metaflow pipeline will read from this table, leveraging the pre-processing done at scale through dbt and Snowflake.
Once the above setup steps are completed, you can run the flow:
- cd into the
src
folder; - run the flow with
METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_merlin_flow.py run --max-workers 4 --with card
, whereMETAFLOW_PROFILE
is needed to select a specific Metaflow config (you can omit it, if you're using the default),AWS_PROFILE
is needed to select a specific AWS config that runs the flow and it's related AWS infrastructure (you can omit it, if you're using the default), andAWS_DEFAULT_REGION
is needed to specify the target AWS region (you can omit it, if you've it already specified in your local AWS PROFILE and you do not wish to change it).
At the end of the flow, you can inspect the default DAG Card with METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_merlin_flow.py card view get_dataset
:
If you run the flow with the full setup, you will end up with:
- versioned datasets and model artifacts, accessible through the standard Metaflow client API;
- a dashboard for experiment tracking, including a quick panel to inspect predicted items for selected shoppers;
- an automated, versioned documentation for your pipeline, in the form of Metaflow cards;
- a live, scalable endpoint serving batched predictions using AWS Lambda and DynamoDB.
- we are now running predictions for all models in parallel over our target set of shoppers. This is wasteful, as we should run predictions only for the winning model, after we run tests that confirm model quality - for now, we sidestep the issue of serializing Merlin model and restore it;
- improving prediction logging, and, generally analysis. Considering our roadmap, improvements will partially come automatically from the RecList Beta roadmap;
- test the magic folder package to share Merlin folders across steps;
- make sure dependencies are easy to adjust depending on setup - e.g. dask_cudf vs pandas depending on your set up;
- support other recSys use cases, possibly coming with more complex deployment options (e.g. Triton on Sagemaker).
TBC
-
What if my datasets are not static to begin with, but depends on real interactions? We open-sourced a serverless pipeline that show how data ingestion could work with the same philosophical principles.
-
I want to add tool X, or replace Y with Z: how modular is this pipeline? Our aim is to present a pipeline simple enough to be quickly grasped, complex enough to sustain a real deep learning model and industry use case. That said, it is possible that what worked for us may not work as perfectly for you: e.g. you may wish to change experiment tracking (e.g., an abstraction for Neptune is here), or use a different data warehouse solution (e.g. BigQuery), or orchestrate the entire thing in a different way (check again here for a Prefect-based solution). We start by providing a flow that "just works", but our focus is mainly on the functional pieces, not just the tools: what are the essential computations we need to run a modern recsys pipeline? If you find other tools are better for you, please go ahead - and let us know, feedback is always useful!
TBC
Main Contributors:
- Jacopo, general design, Metaflow fan boy, prototype;
- the Outerbounds team, in particular Hamel for Metaflow guidance, Valay for AWS Batch support;
- the NVIDIA Merlin team, in particular Gabriel, Ronay, Ben, Even.
Special thanks:
- Dhruv Nair from Comet for double-checking our experiment tracking setup and suggesting improvements.
All the code in this repo is freely available under a MIT License, also included in the project.