ML-ops: A Python repository from asjad99

Notes on MLOPS

ML Engineering = Machine learning systems design + Data Engineering + ML Ops

We present an iterative framework for designing real-world machine learning systems.
We want to take a system level view of things and architect a solution based on business requirements where the end goal of this framework is to build a system that is deployable, reliable, and scalable.
Enterprise grade ML, a term mentioned in a paper put forth by Microsoft, refers to ML applications where there is a high level of scrutiny for data handling, model fairness, user privacy, and debuggability. While toy problems that data scientists solve on laptops using a csv dataset could be intellectually challenging, they are not enterprise grade machine learning problems.
In deployment (via containers or spark applications, for example), governance becomes paramount, especially in regulated environments. Data lineage, data versioning, model versioning, model explainability, model monitoring are all front and center.
Examples of System Design/Data Engineering tasks include:
- Ingest data from a data source
- Build and maintain a data warehouse
- create a data pipeline
- create an analytics table for a specific use case
- migrate data to cloud
- schedule and automate pipelines
- backfill data
- debug data quality issues
- optimize queries
- design a database

Overall ML Engineering entails the following core activities:

Task/Topic	Description of sub-tasks/Topics	Selected Tools	Theory/Notes	Example Code
Frame the problem and Acquire data	1. Identify areas of business that can benifit from machine learning 2. Translating a business problem into a machine learning problem. e.g supervised learning 3. Pick a sucess criteria - How would performance be measured?
Data Storage and Modeling (revise)	1. Acquire relevant data - estimate space and engineering effort - setup a data version control system 2. Creating a data model to store data and facilitating access by other team members 3. Setup Cloud Data Warehouses - Kimball methodology. 4. Design a database - Relational Data Models (Postgres) 5. Document Model - NoSQL Data Models	POSTGRES, mongoDB,Google Big query , AWS - S3
Data Acquizition	Ingest data from a data source e.g Querying data- Pulling data from a database (SQL or NOSQL) or Call S3 API	POSTGRES, mongoDB, AWS - S3
Data Exploration	- Which features are categorical/Numerical? - Which features contain blank, null or empty values? - What are the data types for various features? - What is the distribution of numerical feature values across the samples? What is the distribution of categorical features? Study correlation between a given target variable and all other variables Visual Data Analysis: Applying a dimensionality reduction on a dataset to facilitate model training or gather insights	Pandas, Matplotlib
Data Cleaning	Handle Missing values Handle Outliers/erronous data Get into Tidy data	Pandas, Apache Spark	Spark Notes	ML_Course
Data Preparation/Feature Engineering	Feature Selection Feature Encoding Add new promosing transformations of features Aggregate features into promosing new features		ML_Course	ML_Course
Training models	Using one of the following methods: Linear Regression, Logistic Regression, Decision Trees, Random Forest, XGBoost, Support Vector Machines, K-means, K-Nearest Neighbors, Neural Networks, Principal Component Analysis, Naive Bayes Classifier, Lasso/Ridge regression, etc. Implementing evaluation metrics such as accuracy, precision, recall, intersection over union, or mean average precision (mAP)Grid Search and Cross Validation	scikit-learn	ML_Course	ML_Course
Training Deep Learning Models	Using deep learning for a domain-specific application such as fraud detection, text summarization, machine translation, speech recognition, or object classification, detection, or segmentation Tuning hyperparameters involved in neural network optimization Organizing experiments to get results in the shortest time period Setting up hyperparameter search experiments using tools such as AutoML	TensorFlow, and PyTorch
Data Pipelines	Building and maintaining the organization’s data pipeline systemsimplementing ETL (or ELT) best practices at scale. e.g build an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team.`Designing an ETL system	Airflow
Distributed or Accelerating training	Setting up code to train a model on multiple machines in parallel
Stream Processing	Converting a continuous feature into a categorical feature using bucketing	Spark Streaming , Kafka, AWS Kinesis (Realtime Streaming)
Setting up a cloud environment to deploy the model	Converting prototyped code into production code Mastering cloud tools and infrastructure Preparing files (usually model architecture and parameters) for deployment Encrypting files that store model parameters, architecture, and data Setting up load-balancing requirements with engineers in charge of AI Infrastructure Pruning or quantizing a model so it fits memory requirements	AWS
Present / Launch Solution	Building APIs for an application to use a model - Setting up HTTP RESTful API services to facilitate productionize Setting up authorization and authentication to access the API	Flask etc

Nice to have skills:

Task	Description
Containers	KubernetesDocker
Create Data Lakes with Spark	Data Wrangling with Spark `Setting up Spark Cluster with AWS Debugging and Optimisation`Intro to data lakes `
Feature Store	kind of an in-memory database such that at real time inference we have model features readily available
Speeding up model prediction time	- Applying techniques such as pruning, quantization, or compression to reduce memory requirements - Running inference speed vs. accuracy experiments on a model
Primer on distributed systems	ReplicationPartitioningTransactionsConsistency and Consensus
Deal with constantly shifting distributions	Data Drift :your real world dataset would not always have same distribution. For example the way a person shops in spring would be different than that of winter. So when you train a model on spring data set and deployed it you cant test it when winters come. So the data type is drifted away from normal and this is something to keep an eye Model Drift:now when your model is deployed and you start making predictions online (realtime) with passage of time due to data drift your model performance will de-grade and you would need to keep track of those changes. you would need to re train your model on latest dataset and then re-deploy it

asjad99/ML-ops

Notes on MLOPS

Nice to have skills: