-
This repo is aimed at making it easy to start playing and learning about MLOps.
-
My interest in creating this project was ignited after reading UBER's blog post on (:link: Demand and ETR Forecasting at Airports)
-
Install Python Poetry
curl -sSL https://install.python-poetry.org | python3 -
-
cd into the project folder and run
$ poetry install
-
Activate the virtual env that you just created with
$ poetry shell
-
You work as a data scientist ๐จโ๐ฌ๐ฉโ๐ฌ in a ride-sharing app company ๐ (e.g. Uber)
-
Your job is to help the operations team keep the fleet as busy as possible.
Step 1 - Data Validation โ๏ธ โ
Step 2 - Raw data into time-series data
Step 3 - Time-series data into (features, target) data
Step 4 - From raw data to training data
Step 5 - Explore and visualize the final dataset
- It is a sequence of steps of computing and storage that map recent data to predictions that can be used by the business
Step 1 - Prepare data
- First pipeline -
Data Preparation pipeline or Feature pipeline
- This component runs every hour - For eg: every hour, we extract raw data from an external service - from a data warehouse or wherever the recent data is
- Once we fetch raw data, we then create a tabular dataset with features and target and store them in the feature store
- This is the Data Ingestion Pipeline
Step 2 - Train ML Model
- 2nd pipeline -
Model Training pipeline
- Retrain the model since ML models in real-world systems are trained regularly
- In this project, It's on-demand, whenever I think I want to train the model, I can trigger this pipeline, and it automatically trains, generate a new model and save it back to the model registry
Step 3 - Generate predictions on recent data
- 3rd pipeline -
Prediction pipeline
- USe most recent features and current model we have in production to generate predictions
Serverless MLOps tools
-
Hopsworks
as our feature store- It's a serverless platform that provides an infrastructure to manage and run the feature store automatically
- It's easy to manage unlike GCP, Azure where we have to setup different components first
-
Github Actions
to schedule and run jobs- We automate the feature pipeline that will ingest data every hour
- The notebook is going to automatically run every hour and it's going to fetch a batch of recent data, transform it and save it into features store
- Created a configuration yaml file under
.github/workflows
- The cron job runs every hour
- The command below triggers the notebook execution from command line
poetry run jupyter nbconvert -to notebook -- execute notebooks/12_feature_pipeline.ipynb
Feature Store
- Feature store is used to store features.
- These features can be used to either train the models or make predictions.
- Features saved in the feature store are:
pickup_hour
no_of_rides
pickup_location_id
Backfill the Feature Store
- Fetch files from the year 2022
- Transform raw data into time series data
- Dump it in the feature store
- Repeat for the year 20223 and so on
- work in progress