
Software Engineering for Data Intensive Applications

Primary LanguagePython

Notes on MLOPS

Screenshot 2023-05-16 at 8 10 28 pm

ML Engineering = Machine learning systems design + Data Engineering + ML Ops

  • We present an iterative framework for designing real-world machine learning systems.
  • We want to take a system level view of things and architect a solution based on business requirements where the end goal of this framework is to build a system that is deployable, reliable, and scalable.
  • Enterprise grade ML, a term mentioned in a paper put forth by Microsoft, refers to ML applications where there is a high level of scrutiny for data handling, model fairness, user privacy, and debuggability. While toy problems that data scientists solve on laptops using a csv dataset could be intellectually challenging, they are not enterprise grade machine learning problems.
  • In deployment (via containers or spark applications, for example), governance becomes paramount, especially in regulated environments. Data lineage, data versioning, model versioning, model explainability, model monitoring are all front and center.
  • Examples of System Design/Data Engineering tasks include:
    • Ingest data from a data source
    • Build and maintain a data warehouse
    • create a data pipeline
    • create an analytics table for a specific use case
    • migrate data to cloud
    • schedule and automate pipelines
    • backfill data
    • debug data quality issues
    • optimize queries
    • design a database

Overall ML Engineering entails the following core activities:

Task/Topic Description of sub-tasks/Topics Selected Tools Theory/Notes Example Code
Frame the problem and Acquire data 1. Identify areas of business that can benifit from machine learning
2. Translating a business problem into a machine learning problem. e.g supervised learning
3. Pick a sucess criteria - How would performance be measured?
Data Storage and Modeling (revise) 1. Acquire relevant data - estimate space and engineering effort - setup a data version control system
2. Creating a data model to store data and facilitating access by other team members
3. Setup Cloud Data Warehouses - Kimball methodology.
4. Design a database - Relational Data Models (Postgres)
5. Document Model - NoSQL Data Models
POSTGRES, mongoDB,Google Big query , AWS - S3
Data Acquizition Ingest data from a data source e.g Querying data- Pulling data from a database (SQL or NOSQL) or Call S3 API POSTGRES, mongoDB, AWS - S3
Data Exploration - Which features are categorical/Numerical?
- Which features contain blank, null or empty values?
- What are the data types for various features?
- What is the distribution of numerical feature values across the samples?
What is the distribution of categorical features?
Study correlation between a given target variable and all other variables
Visual Data Analysis:
Applying a dimensionality reduction on a dataset to facilitate model training or gather insights
Pandas, Matplotlib
Data Cleaning Handle Missing values
Handle Outliers/erronous data
Get into Tidy data
Pandas, Apache Spark Spark Notes ML_Course
Data Preparation/Feature Engineering Feature Selection
Feature Encoding
Add new promosing transformations of features
Aggregate features into promosing new features
ML_Course ML_Course
Training models Using one of the following methods: Linear Regression, Logistic Regression, Decision Trees, Random Forest, XGBoost, Support Vector Machines, K-means, K-Nearest Neighbors, Neural Networks, Principal Component Analysis, Naive Bayes Classifier, Lasso/Ridge regression, etc.
Implementing evaluation metrics such as accuracy, precision, recall, intersection over union, or mean average precision (mAP)Grid Search and Cross Validation
scikit-learn ML_Course ML_Course
Training Deep Learning Models Using deep learning for a domain-specific application such as fraud detection, text summarization, machine translation, speech recognition, or object classification, detection, or segmentation
Tuning hyperparameters involved in neural network optimization
Organizing experiments to get results in the shortest time period
Setting up hyperparameter search experiments using tools such as AutoML
TensorFlow, and PyTorch
Data Pipelines Building and maintaining the organization’s data pipeline systemsimplementing ETL (or ELT) best practices at scale. e.g build an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team.`Designing an ETL system Airflow
Distributed or Accelerating training Setting up code to train a model on multiple machines in parallel
Stream Processing Converting a continuous feature into a categorical feature using bucketing Spark Streaming , Kafka, AWS Kinesis (Realtime Streaming)
Setting up a cloud environment to deploy the model Converting prototyped code into production code
Mastering cloud tools and infrastructure Preparing files (usually model architecture and parameters) for deployment
Encrypting files that store model parameters, architecture, and data
Setting up load-balancing requirements with engineers in charge of AI Infrastructure
Pruning or quantizing a model so it fits memory requirements
Present / Launch Solution Building APIs for an application to use a model - Setting up HTTP RESTful API services to facilitate productionize
Setting up authorization and authentication to access the API
Flask etc

Nice to have skills:

Task Description
Containers KubernetesDocker
Create Data Lakes with Spark Data Wrangling with Spark Setting up Spark Cluster with AWS Debugging and OptimisationIntro to data lakes `
Feature Store kind of an in-memory database such that at real time inference we have model features readily available
Speeding up model prediction time - Applying techniques such as pruning, quantization, or compression to reduce memory requirements - Running inference speed vs. accuracy experiments on a model
Primer on distributed systems ReplicationPartitioningTransactionsConsistency and Consensus
Deal with constantly shifting distributions Data Drift :your real world dataset would not always have same distribution. For example the way a person shops in spring would be different than that of winter. So when you train a model on spring data set and deployed it you cant test it when winters come. So the data type is drifted away from normal and this is something to keep an eye Model Drift:now when your model is deployed and you start making predictions online (realtime) with passage of time due to data drift your model performance will de-grade and you would need to keep track of those changes. you would need to re train your model on latest dataset and then re-deploy it