Some helpful notes for Machine Learning System Design Interview preparation, which I gathered from various resources to prepare for machine learning systems design interview.
Learn how to design Machine Learning systems and prepare for an interview.
Facebook Field Guide to Machine Learning
CS 329S: Machine Learning Systems Design, Stanford, Winter 2022
ML Systems Design Interview Guide
ML System Design interview example
Feel free to submit pull requests to help:
- Fix errors
- Improve sections
- Add new sections and cases
- Ask questions
- Tell pros and cons of different solutions
- Start with a simple solution as a baseline
After listening to the conditions of the case, ask the interviewer clarifying questions. Repeat the main points to make sure you understand everything correctly. During the interview, ask the questions and state your assumptions.
Understand requirements:
- Users/samples number
- Peak numbers of requests
- Batch or online predictions
- Edge or server computations
Define proxy machine learning metric for the business goal.
- Define the business goal.
- Define ML task type. Classification/regression/other
- Split the task into subtasks. Example: maximize users engagement while minimizing the spread of extreme views and misinformation.
- Data source and data type
- Where the data comes from? Is it in the same format or should we transform and join it?
- One sample of data. What is X(features) and what is Y(labels)?
- Labeling
- Are the labels known? Is there Natural labelling? Should we label some data?
- Sampling
- Data recency and Distribution drift.
- Offline evaluation
- Data split
- Random split or should split by date, users, products to prevent data leakage?
- Metric
- Choose a metric, that is interpretable and sensitive to the task. Think what errors will be most harmful, FP or FN for classification, over or underpredicting for regression.
- Baseline
- Mention baseline Non-ml-solution. You will compare your machine learning models with this baseline.
- Data split
- Online evaluation
- Online-offline gap
- Online comparing
- A/B randomised test
- A/A test.
- What type of data do we have? Can we encode it?
- Feature representation, data preprocessing.
- Data augmentation.
- Pick the model.
- Pros and cons of the model.
- Architecture overview at a glance.
- Linear model
- GBDT
- Embeddings + KNN
- Neural networks
- What loss function you will use?
- Deployment
- Experiments
- Monitoring & Continual Learning
- Batch prediction vs Online prediction
- Model compression
- Low-rank factorisation. Mobile net optimizations.
- Pruning
- Knowledge distillation
- Quantization
- Special inference formats. (pb, onnx, torchscript)
- Edge / Cloud computing
- Monitoring -> Continual Learning
- Detect distribution shift -> adapt with CL
- Check if more frequent will boost
- Eval
- Offline - sanity check
- Online
- Canary - two models old and new, slowly route traffic
- A/B - test
- Interleaved experiments - preds from 2 models are mixed
- Shadow test - log prediction from new, then study
- Internal test on coworkers. But it is a sanity check
- Data source
- User-generated - user data
- Sys generated - internal data
- 3party data, collect from all, mention GDPR
- Data formats
- Row or column-major
- Row - fast writes
- Column - fast column reads
- Txt/binary
- Row or column-major
- Data models
- Relational
- Nosql - not only SQL
- Document
- Graph
- Structured/UnStructured
- Data storage engines and processing
- ETL /ELT
Sampling - sampling from all possible real-world data to create training data
- Non-Probability Sampling - selective bias - OK for init
- Convenience - what is available
- Snowball - Example: collect friends of friends on the social network.
- Judgement - expert decision.
- Quota - Example: 30 from each age group
- Probability Sampling
- Simple random sampling
- Stratified
- Weighted
- Importance sampling (RL)
- Reservoir sampling - sampling from the stream.
- K samples in a reservoir, N - length of the current stream
- For new element generate i = random(1, n), if i<k: replace in reservoir
- Correct probability for each sample in the reservoir.
- Training sample selection. For example positive, negative sampling in metric learning.
- Label types
- Hand label
- Measure of consent - Fleiss' kappa
- Data lineage - preserve the source of data, if its labelling worsen model
- Natural label e.g. - like/dislike, click etc
- Hand label
- Handling the lack of hand labels
- Weak supervision - define labelling function(e.g. regex) and label data. Pros: cheap, Cons: noisy
- Semi-supervision - train model on small subset then label all data with the model.
- Transfer learning. Train on one task/domain. Change final layer -> train on the final subset of data.
- Active learning/query learning - decide what samples to label next.
- Class imbalance
- Important to use the right metric based on error cost.
- Techniques
- Data-level - resample
- Under/oversample
- SMOTE - oversample with new points between existing
- Algorithm-level
- Balanced loss
- Focal loss - hard examples > weight CE*(1-pt)^hamma
- Data-level - resample
- Overall.
- Handling missing values
- Feature crossing
- Numeric values.
- demeaning
- scaling - e.g. log for skew
- remove outliers
- bining
- quantization
- Categorical values.
- One-hot
- Hashing trick
- Embedding
- Text.
- Embedding. Bert-like-models.
- Fasttext average. (fast)
- Complex.
- Concatenation. Example: Product title + category + other features
- Encode -> Attention. Example: User history
- Offline evaluation
- How to split the data?
- classic K-FOLD (tr+val) + ts
- If time-sensitive data. Data sorted by time:
- split by time
- splitting by time + margin
- prequential validation
- If user/product sensitive data.
- split by user/product to prevent data leakage
- If cold start problem.
- drop some data from history
- make some users with empty or min history
- Specificity/Sparsity trade-off
- **Choose metric **
- Interpretable. You can say what exactly metric is showing.
- Sensitive to the task. By metric, we can say in the context of the task what model is better.
- Calibration on a test.
- Slicing. Slice by some features to find model failures.
- How to split the data?
- Baseline evaluation.
- Random label
- Majority label
- Simple heuristic (if contains swear word -> toxic message)
- Human label
- Existing solution
- Evaluation methods
- Perturbation test - Check the model on noisy samples.
- Invariance (Fairness)
- Directional expectations - sanity check
- Model calibration. Outputs from ML models are not necessarily probabilities. If you need the probe -> calibrate the model.
- Confidence measure
- Slice based metrics.
- Heuristic
- Error analysis
- Slice finder algos (FreaAI). Generate with beam search, then check.
- Handling missing values
- Scaling - e.g. log for skew
- Discretization
- Encoding. Hashing trick
- Feature crossing, e.g. in recsys to add nonlinearly
- Pos embeddings, e.g. pos encoding in bert
- Simple Label-preserving - synonyms
- Perturbation/Adversarial - add hard samples - noise
- Data synth - add new samples (use diff model)
- Online-offline gap
- Online
- A/B randomised test
- Minimise time to first online
- Isolate eng bug from ml issues
- A/A test before adding real model w new system
- Real-world FL
- Backtest + forward test
- More metrics to check the correctness
- Triangulate causes, so iterate atomic steps
- Have a backup plan
- Calibration average prediction = average ground-truth prediction
- != on the train - underfitting, test - overfit, online - on/offline gap