MeowLearning: Train your team on Production AI - quickly

This is a pick-your-problem style guide I created to educate everyone from my leadership team to my ML engineers on how to work with AI in production settings. This is the stuff you won't learn in most ML/AI courses.

Previously CopyCat's internal AI Guidelines

All reading topics are in reading order.

Fundamentals

Readings covering a fundamental overview of how Machine Learning Systems are made

A Taxonomy of ML and AI for those who are unfamiliar with the field

Machine Learning Systems Design - Part 1

Machine Learning Systems Design - Part 2

If you're tackling a specific system design stage and want to dive deeper, I'd recommend perusing specific parts of:

CS 329S: Machine Learning Systems Design @Stanford by Chip Hyuen

and/or it's more formal book version

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications by Chip Hyuen

Rules of Machine Learning: | Google Developers

Large Language Models | Full Stack Deep Learning

My Golden Rule above everything else

Focus on the speed at which you can run valid experiments. It is the only way to find a viable model for your problem.

What experience taught me?

Your problems will often lie more in the data than the modelling. Fundamentals of Data Science (Shit in Shit Out) always apply, so focus on figuring out how to refine the data quality and use a system that focuses on this.

Product Leadership team can stop here

AutoML: Before anything else - because the Golden Rule is key!

Recent experiences have shown me that AutoML has come a long way in the past five to seven years, especially for Tabular Machine Learning. So my latest recommendation is to use it. Use it first.

Get your data in order ~ clean | preprocess | shrink/project if needed
Use AutoML
See what baselines it gives - if it works out of the box, I'm happy for you but very jealous! :p

When not to use AutoML right away?

Sometimes you've got a really complex problem and/or no one has solved something similar before and/or a "lot" of data. Here AutoML probably will be inefficient compared to taking a first stab at doing research on narrowing down what to use architecture/preprocessing wise.

AutoML tools

Tip: You can use a foundation model like CLIP-VIT or GPTx as a pre-processor to make any data into structured data (embedding) for tasks as a quick and dirty experiment.

*Structured ~ Tabular ~ Embedding ~ Preprocessed

Lazy Predict - Structured
AutoGluon - Structured | Image | Text | Multimodal | Time series
H2O - Tabular possibly Structured
MLJar - Structured - Has auto-feature engineering
AutoPytorch - Structured
AutoSklearn - Structured
TPOT - Structured - Has auto-feature engineering
TPOT2 - Structured - Has auto-feature engineering
AutoKeras - Structured | Image | Text | Time Series | Multimodal
FLAML - Structured
PyCaret - Structured | Time Series | Text
AutoGen - LLMs
TransmogrifyAI - Structured
Model Search by Google - Structured | Image | Text | Audio | Time Series - Use with care this is compute expensive

A number of these are also extendable with your custom models which aren't just Tabular - FLAML, AutoGluon, AutoKeras

If you have the ($)_($) 15. GCP Vertex AutoML 16. AWS SageMaker AutoML

Theoretrically you can also use any model hub for "AutoML" if you combine it with a sweeping agent.

E.g. HuggingFace Autotrain + Weights and Biases Sweeps - Technically not AutoML but so many models so it's so very easy to do

Research

Where to look for models/techniques and the like?

Model Zoos
- Hugging face and their Github
- PyTorch Hub
- Torchvision
- Torchaudio
- Torchtext - Hugging face is way better for this but just in case
- TIMM - Vision models - Also check their Hugging Face Page
- Tensorflow Hub
- Model Zoo.co
- Nvidia Model Zoo
- ONNX Model Zoo
- NVIDIA NGC Model Zoo
- Facebook Research Model Zoo
- Keras Applications
- Caffe Model Zoo
- MXNet/Gluon Model Zoo
- Apple Machine Learning Models
- some more exist...

Please note that the availability and content of these model zoos may vary, so it's always best to refer to the official documentation provided by each platform.

AI company Githubs
- Laion
- Ultralytics
- AirBnB
- Facebook AI
- Google AI
- Microsoft AI
- Netflix TechBlog
- and many more…
AI lab blogs
- CSAIL - MIT/Stanford
- CMU Blog
- Taiwan AI Labs
- Deepmind blog
- OpenAI Blog
- Synced Review
- BAIR
- and many more…
AI researcher's personal blogs
Paper author's Github
Github search - but follow the rules below to quickly filter out duds

Where not to look for models?

Towards Data Science and other unmoderated blogs (unless they link to one of the above)
Kaggle
Github Snippets

Rules to figure out what is and is not promising

Look at whether existing implementations exist
If no then I highly recommend finding another architecture, this process can be excruciating and time consuming but if you do have to:
- Select a design pattern to write the Neural Network in - I love a class based system like PyTorch does and then using PyTorch Lightning's prescribed format on top of it.
- Read the research paper and see if they've specified all the bits and bobs of the architecture and the training process - if not email the authors - you may get lucky
- Write the pseudocode - especially the math
- Implementation
  - Beginner Advice on Learning to Implement Machine Learning Models
  - Courses that help and are great reference material:
  - Coursera and Udacity courses tend to have everything handed to you in a sliver platter so while they're good for basics they don't help with this much.
- Optimize later - premature optimization is the bane of all good code
If yes then
- Look at code cleanliness first and foremost. Bad AI code is a major pain. Ask me for stories about PyTorch’s FasterRCCN being broken and how we wasted 1 month behind it.
  - Look at Git repository popularity. More people using something often means bugs have been caught or addressed.
  - Look at open issues in the git repo.
- Look at if people have used these kinds of models in production already
- Look at if you can find if these models have been used on data of similar complexity and scale as the use case
- Understand the math and the process. Is it actually going to do something meaningful to your data - especially in the case of self-supervised learning. E.g. random cropping images of buttons to learn features in an auto-encoder won’t make sense but doing it for a whole UI image might.
See if the dataset the model is trained and tested on is publicly available and feasibly downloadable - if not don't fret too much on this step since the goal is to make it work for your data and your problem.
Test your implementation on the dataset and see if you can reproduce results within the ball park (~2-5% error difference is fine)
See debugging your AI models section below

More covered in planning below.

Planning

Finding the best way to collect data + Finding the right metric

Understanding and Planning LLMs

Blazing through these lectures by FullStack Deep learning will let you get pretty much all you need to know about LLMs

So simple question - Should I train and/or invest in working with an LLM?

I think the answer is "it" depends - but know that it's probably expensive. So see if you can get close with prompt tuning, and if you can't then fine-tuning it on your own data, then consider model distillation, and finally think about full training.

Baseline Testing

Also covered further in pre-requisite readings

Make your test dataset before you train anything

Setting up an appropriate baseline is an important step that many candidates forget. There are three different baselines that you should think about:

Random baseline: if your model predicts everything randomly, what's the expected performance?
Human baseline: how well would humans perform on this task?
Simple heuristic: for example, for the task of recommending the app to use next on your phone, the simplest model would be to recommend your most frequently used app. If this simple heuristic can predict the next app accurately 70% of the time, any model you build has to outperform it significantly to justify the added complexity.

Testing LLMs is hard

But here's the best we've been able to figure out - research here is always progressing

Data Tagging

It's extremely hard to find good advice or a one-size fits all solution on with data annotation and what works well but here are a few resources I've been able to find.

Tagging Guidelines Labelling Guidelines by Eugene Yan

How to Develop Annotation Guidelines by Prof. Dr. Nils Reiter

Data pipeline integrity

Great Expectations: Helps data teams eliminate pipeline debt, through data testing, documentation, and profiling. - Your new best friend as a Data Scientist
Soda Core: Data profiling, testing, and monitoring for SQL accessible data. - Your kinda sorta-best friend
ydata-quality: Data Quality assessment with one line of code. - is cool but inflexible
Pandas Profiling: Extends the pandas DataFrame with df.profile_report() for quick data analysis.
DataProfiler: A Python library designed to make data analysis, monitoring and sensitive data detection easy. - Bit tough to use

Data tagging platforms I love using Scale AI for tagging but if you're looking for something free then LabelStudio is a good start

Diagnosing Data

Training

Section not essential for anyone but MLEs

Understanding common data challenges in training

Understanding training, model selection, and other processes

Debugging AI models

Section not essential for anyone but MLEs

Google Model Tuning Playbook

Full Stack Deep Learning - Lecture 7: Troubleshooting Deep Neural Networks

Prototyping

It's often very useful to setup an internal prototyping/testing interface for any AI model + it's data that you plan to deploy

Gradio

Streamlit

Testing and expandability

Infrastructure challenges and considerations

Full Stack Deep Learning - Lecture 10: Testing & Explainability

Tools

ZenoML - Data and model result explainability - very new but simple and great for computer vision
Netron: Visualizer for neural network, deep learning, and machine learning models.
Deepchecks: Test Suites for Validating ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.
Evidently: Interactive reports to analyze ML models during validation or production monitoring.
I'd also highly recommend some kind of hardware usage monitoring to see if models are actually efficient e.g. RAM, CPU, GPU % util - most if not Cloud Platforms have this.

Deployment and Scaling beyond your own machine

Deployment checklist

GitHub - modzy/model-deployment-checklist: An efficient, to-the-point, and easy-to-use checklist to following when deploying an ML model into production.

Understanding infrastructure in general

This moves very fast and gets crazier by the month. Just take a look at the MAD: Machine Learning, Artificial Intelligence & Data Landscape for 2023.

But here are the essentials you need to make AI happen smoothly:

Code versioning
Model and Artifact versioning
Data versioning
Data storage and collection/collation pipeline
Model training infra - GPU machines + and preferably platforms like KubeFlow or SageMaker
Experiment Tracking - My go to is Weights and Biases or Tensorboard but if you're looking for a more packaged solution MLFlow is great
Monitoring/Logging
Inference Deployment Mechanism (more below)

There are also a bunch of all-in-one platforms that do all or most of these things like MLFlow, Neptune, Sagemaker, Vertex, or Polyaxon.

Full Stack Deep Learning - Lecture 6: MLOps Infrastructure & Tooling

Deployment

The deployment “stack” - this also keeps moving quite fast but the basic principles remain the same. A quick Google Search doesn't hurt though.

Here's the last thing I saw that showed the latest changes in the landscape: A Shift in ML Deployment by James Detweiler (Felicis VC)

Know your use-case's deployment platform e.g. Mobile, Web, Edge, etc.
Find the stack/toolkit/library that works on your platform and with company requirements e.g Tensorflow Lite (Mobile/Edge), Google's Vertex AI prediction (SaaS), Torchserve/TFX (Backend and enterprise grade) and BentoML (Backend but simpler)
Understand your use-case's constraints e.g. real-time for video or batch for recommendation engines
Optimize for time/cost/performance/hardware

Full Stack Deep Learning - Lecture 11: Deployment & Monitoring

UPDATE: Lecture 5: Deployment Specific Deployment System Examples which are short but good Neptune AI Blog has some good examples too

LLM - Large Language Models by popular request

I've found that deployment depends on model needs but Hugging Face has done a great job providing a API interface that "just works".

Deploying LLMs via HuggingFace
Otherwise the same stuff as above

Monitoring

How to make sure you know when ML systems fail and you can see and know it

Tools

Aporia: Observability with customized monitoring and explainability for ML models.
Gantry: ML Observability platform with analytics, alerting, and human feedback
Arize: An end-to-end ML observability and model monitoring platform.
WhyLabs: AI Observability platform - they also have opensource components
Fiddler: Monitor, explain, and analyze your AI in production.
Superwise: Fully automated, enterprise-grade model observability in a self-service SaaS platform.

If these platforms don't work for you I recommend making your own pipeline using either the:

ELK Stack
Grafana's Stack
Manifold: A model-agnostic visual debugging tool for machine learning.
Your own logger + Your own data stores + Your own BI (Metabase, Superset, etc.) - not recommended

I'd also highly recommend some kind of hardware usage monitoring to see if models are actually efficient e.g. RAM, CPU, GPU % util - most if not Cloud Platforms have this.

How do I keep up with the AI, Data Science and ML world?

I follow a few newsletters like The Gradient, TLDR AI and The Batch. Then I augment them by the RSS feed below
I tend to sometimes look at Arxiv Sanity
I look at popular topics on Twitter and the common Hashtags.
I tend to loosely follow the RSS feeds of the following blogs (I've uploaded the OPML file for this in this repo):