This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I’ve done my best to select the papers that I think are novel or significant.

Papers proceeded by β€œSee also” indicate either additional historical context or else major developments, breakthroughs, or applications.

Association Rule Learning

  • Scalable Algorithms for Association Mining (2000). Zaki, @IEEE πŸ”’.

  • Mining Frequent Patterns without Candidate Generation (2000). Han, Pei, and Yin, @acm .

  • Mining Association Rules between Sets of Items in Large Databases (1993), Agrawal, Imielinski, and Swami, @CiteSeerX πŸ›οΈ.

  • See also: The GUHA method of automatic hypotheses determination (1966), HΓ‘jek, Havel, and Chytil, @Springer πŸ”’ πŸ›οΈ.


  • The Enron Corpus: A New Dataset for Email Classification Research (2004), Klimt and Yang, @Springer πŸ”’ / @author πŸ”‘.
  • See also: Introducing the Enron Corpus (2004), Klimt and Yang, @author.
  • ImageNet: A large-scale hierarchical image database (2009), Deng et al., @IEEE πŸ”’ / @author πŸ”‘.
  • See also: ImageNet Large Scale Visual Recognition Challenge (2015), @Springer πŸ”’ / @arXiv πŸ”‘ + @author 🌐.

Decision Trees

  • Induction of Decision Trees (1986), Quinlan, @Springer.

Deep Learning

AlexNet (image classification CNN)
  • ImageNet Classification with Deep Convolutional Neural Networks (2012), @NIPS.
Convolutional Neural Network
  • Gradient-based learning applied to document recognition (1998), LeCun, Bottou, Bengio, and Haffner, @IEEE πŸ”’ / @author πŸ”‘.
  • See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980), Fukushima, @Springer πŸ›οΈ.
  • See also: Phoneme recognition using time-delay neural networks (1989), Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE πŸ›οΈ.
  • See also: Fully Convolutional Networks for Semantic Segmentation (2014), Long, Shelhamer, and Darrell, @arXiv.
DeepFace (facial recognition)
  • DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014), Taigman, Yang, Ranzato, and Wolf, Facebook Research.
Generative Adversarial Network
  • General Adversarial Nets (2014), Goodfellow et al., @NIPS + @Github πŸ’½.
  • Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github πŸ’½ + @OpenAI πŸ“”.
  • See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI πŸ”¬ + @Github πŸ’½ + @OpenAI πŸ“”.
  • See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI πŸ“”.
Inception (classification/detection CNN)
  • Going Deeper with Convolutions (2014), Szegedy et al., @ai.google + @Github πŸ’½.
  • See also: Rethinking the Inception Architecture for Computer Vision (2016), Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, @ai.google πŸ”¬.
  • See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), Szegedy, Ioffe, Vanhoucke, and Alemi, @ai.google πŸ”¬.
Long Short-Term Memory (LSTM)
  • Long Short-term Memory (1995), Hochreiter and Schmidhuber, @CiteSeerX.
Residual Neural Network (ResNet)
  • Deep Residual Learning for Image Recognition (2015), He, Zhang, Ren, and Sun, @arXiv.
Transformer (sequence to sequence modeling)
  • Attention Is All You Need (2017), Vaswani et al., @NIPS.
U-Net (image segmentation CNN)
  • U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), Ronneberger, Fischer, Brox, @Springer πŸ”’ / @arXiv πŸ”‘.
VGG (image recognition CNN)
  • Very Deep Convolutional Networks for Large-Scale Image Recognition (2015), Simonyan and Zisserman, @arXiv + @author 🌐 + @ICLR πŸ“Š + @YouTube πŸŽ₯.

Ensemble Methods

  • A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997β€”published as abstract in 1995), Freund and Schapire, @CiteSeerX.

  • See also: Experiments with a New Boosting Algorithm (1996), Freund and Schapire, @CiteSeerX πŸ”¬.

  • Bagging Predictors (1996), Breiman, @Springer.
Gradient Boosting
  • Greedy function approximation: A gradient boosting machine (2001), Friedman, @Project Euclid.
  • See also: XGBoost: A Scalable Tree Boosting System (2016), Chen and Guestrin, @arXiv πŸ”¬ + @GitHub πŸ’½.
Random Forest
  • Random Forests (2001), Breiman and Schapire, @CiteSeerX.


  • Mastering the game of Go with deep neural networks and tree search (2016), Silver et al., @Nature.
Deep Blue
  • IBM's deep blue chess grandmaster chips (1999), Hsu, @IEEE πŸ”’.
  • See also: Deep Blue (2002), Campbell, Hoane, and Hsu, @ScienceDirect πŸ”’.


  • Adam: A Method for Stochastic Optimization (2015), Kingma and Ba, @arXiv.
Expectation Maximization
  • Maximum likelihood from incomplete data via the EM algorithm (1977), Dempster, Laird, and Rubin, @CiteSeerX.
Stochastic Gradient Descent
  • Stochastic Estimation of the Maximum of a Regression Function (1952), Kiefer and Wolfowitz, @ProjectEuclid.
  • See also: A Stochastic Approximation Method (1951), Robbins and Monro, @ProjectEuclid πŸ›οΈ.


Non-negative Matrix Factorization
  • Learning the parts of objects by non-negative matrix factorization (1999), Lee and Seung, @Nature πŸ”’.
  • The PageRank Citation Ranking: Bringing Order to the Web (1998), Page, Brin, Motwani, and Winograd, @CiteSeerX.
DeepQA (Watson)
  • Building Watson: An Overview of the DeepQA Project (2010), Ferrucci et al., @AAAI.

Natural Language Processing

Latent Dirichlet Allocation
  • Latent Dirichlet Allocation (2003), Blei, Ng, and Jordan, @JMLR
Latent Semantic Analysis
  • Indexing by latent semantic analysis (1990), Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX.
  • Efficient Estimation of Word Representations in Vector Space (2013), Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code πŸ’½.

Neural Network Components

  • Autograd: Effortless Gratients in Numpy (2015), @ICML + @ICML πŸ“Š + @Github πŸ’½.
  • Learning representations by back-propagating errors (1986), Rumelhart, Hinton, and Williams, @Nature πŸ”’.
  • See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989), LeCun et al., @IEEE πŸ”’πŸ”¬ / @author πŸ”‘.
Batch Normalization
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Ioffe and Szegedy @ICML via PMLR.
  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR.
Gated Recurrent Unit
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), Cho et al, @arXiv.
  • The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958), Rosenblatt, @CiteSeerX.

Recommender Systems

Collaborative Filtering
  • Using collaborative filtering to weave an information tapestry (1992), Goldberg, Nichols, Oki, and Terry, @CiteSeerX.
Matrix Factorization
  • Application of Dimensionality Reduction in Recommender System - A Case Study (2000), Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX.
  • See also: Learning Collaborative Information Filters (1998), Billsus and Pazzani, @CiteSeerX πŸ›οΈ.
  • See also: Netflix Update: Try This at Home (2006), Funk, @author πŸ“” πŸ”¬.
Implicit Matrix Factorization
  • Collaborative Filtering for Implicit Feedback Datasets (2008), Hu, Koren, and Volinsky, @IEEE πŸ”’ / @author πŸ”‘.


Elastic Net
  • Regularization and variable selection via the Elastic Net (2005), Zou and Hastie, @CiteSeer.
  • Regression Shrinkage and Selection Via the Lasso (1994), Tibshirani, @CiteSeerX.
  • See also: Linear Inversion of Band-Limited Reflection Seismograms (1986), Santosa and Symes, @SIAM πŸ›οΈ.


  • MapReduce: Simplified Data Processing on Large Clusters (2004), Dean and Ghemawat, @ai.google.
  • TensorFlow: A system for large-scale machine learning (2016), Abadi et al., @ai.google + @author 🌐.
  • Torch: A Modular Machine Learning Software Library (2002), Collobert, Bengio and MariΓ©thoz, @Idiap + @author 🌐.
  • See also: Automatic differentiation in PyTorch (2017), Paszke et al., @OpenReview πŸ”¬+ @Github πŸ’½.

Supervised Learning

k-Nearest Neighbors
  • Nearest neighbor pattern classification (1967), Cover and Hart, @IEEE πŸ”’.
  • See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989), Silverman and Jones, @JSTOR πŸ”’.
Support Vector Machine
  • Support Vector Networks (1995), Cortes and Vapnik, @Springer.


The Bootstrap
  • Bootstrap Methods: Another Look at the Jackknife (1979), Efron, @Project Euclid.
  • See also: Problems in Plane Sampling (1949), Quenouille, @Project Euclid πŸ›οΈ.
  • See also: Notes on Bias Estimation (1958), Quenouille, @JSTOR πŸ›οΈ.
  • See also: Bias and Confidence in Not-quite Large Samples (1958), Tukey, @Project Euclid πŸ”¬.


