c17hawke/Data-augmentation-DMLS

Summary of Data Augmentation topics from book - "Huyen, C. (2022). Designing Machine Learning Systems"

Jupyter Notebook

Data-Augmentation

It is a family of techniques to increase the amount of training data

There are 3 kinds of data augmentation techniques as per the author -

Simple label-preserving transformation
Perturbation
Data synthesis

	Simple label-preserving transformation	Perturbation	Data synthesis
What?	Random modification of data while preserving the label.	Adding noise to the data while preserving the label.	Use GANs to generate synthetic data. Can use costly DALL-E-like services as well.
Examples in CV	Random flipping, Random rotation, etc.	Adding noise patches, or changing a single pixel values	Using CycleGAN to synthesize or generate new samples.
Examples in NLP	Replacing words in a sentence with its synonyms	Adding random symbols, or words in a sentence	Using templating to generate new samples
Why?	Increase training sample per label/class	To improve model performance as well as evaluate model performance (i.e. How good is our model to adversarial attacks)	Increase training data using GAN techniques.

Example notebooks -

Example notebooks for CV - link
Example notebooks for NLP - link

References -

[1] Huyen, C. (2022). Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications