- I emphasize mathematical/conceptual foundations because implementations of ideas(ex. Torch, Tensorflow) will keep evolving but the underlying theory must be sound. Anybody with an interest in deep learning can and should try to understand why things work.
- I include neuroscience as a useful conceptual foundation for two reasons. First, it may provide inspiration for future models and algorithms. Second, the success of deep learning can contribute to useful hypotheses and models for computational neuroscience.
- Information Theory is also a very useful foundation as there's a strong connection between data compression and statistical prediction. In fact, data compressors and machine learning models approximate Kolmogorov Complexity which is the ultimate data compressor.
You might notice that I haven't emphasized the latest benchmark-beating paper. My reason for this is that a good theory ought to be scalable which means that it should be capable of explaining why deep models generalise and we should be able to bootstrap these explanations for more complex models(ex. sequences of deep models(aka RNNs)). This is how all good science is done.
For an excellent historical overview of deep learning, I would recommend reading Deep Learning in Neural Networks as well as R. Salakhutdinov's Deep Learning Tutorials.
-
History:
-
Optimisation:
- Learning Internal Representations by Error Propagation(D. Rumelhart et al. 1996. MIT Press )
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift(S. Ioffe. 2015. ICML.)
- Weight Normalization (Salimans 2016. NIPS.)
- Bayesian Back-Propagation (W. Buntine & A. Weigend 1991.)
- Credit Assignment through Time: Alternatives to Backpropagation (Y. Bengio. 1993. NIPS.)
- Adam: A method for Stochastic Optimization (D. Kingma 2015. ICLR.)
- Understanding Synthetic Gradients and Decoupled Neural Interfaces(W. Czarnecki 2017. CoRR.)
- Learning Deep ResNet Blocks Sequentially using Boosting Theory (F. Huang et al. 2017.)
- Failures of Gradient-Based Deep Learning (S. Shalev-Schwartz et al. 2017.)
- On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (N. Keskar et al. 2017. ICLR.)
-
Regularisation:
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting (N. Srivastava et al. 2014. Journal of Machine Learning Research.)
- Why Does Unsupervised Pre-training Help Deep Learning? (D. Erhan et al. 2010. Journal of Machine Learning Research.)
- Semi-Supervised Learning with Ladder Networks (A. Rassmus et al. 2015. NIPS.)
- Tensor Contraction Layers for Parsimonious Deep Nets(J. Kossaifi et al. 2017.)
-
Inference:
- Uncertainty in Deep Learning(Yarin Gal. 2017. University of Cambridge.)
- Mixture Density Networks (Bishop 1994)
- Dropout as a Bayesian Approximation(Yarin Gal. 2016. ICML. )
- Markov Chain Monte Carlo and Variational Inference: Bridging the Gap (Salimans. 2015. ICML.)
- Auto-Encoding Variational Bayes (D. Kingma & M. Welling. 2014. ICLR.)
- Variational Dropout and the Local Reparameterization Trick (D. Kingma, T. Salimans & M. Welling. 2015. NIPS.)
- Improved Variational Inference with Inverse Autoregressive Flow (D. Kingma, T. Salimans et al. 2017. NIPS.)
- Avoiding pathologies in very deep networks (D. Duvenaud et al. 2014. AISTATS.)
- Stochastic Gradient Hamiltonian Monte Carlo (T. Chen. 2014. ICML.)
- On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes(A. Matthews et al. 2016. AISTATS.)
- Scalable Gaussian Process inference using variational methods (A. Matthews. 2016.)
-
Representation Learning:
- Representation Learning: A Review and New Perspectives (Y. Bengio et al. 2013. IEEE Transactions on Pattern Analysis and Machine Intelligence.)
- Deep Learning of Representations for Unsupervised and Transfer Learning (Y. Bengio. 2012. ICML.)
- Learning Invariant Feature Hierarchies (Y. Lecun. 2012. ECCV Workshops.)
- Independently Controllable Features (E. Bengio et al. 2017.)
- On the number of response regions of deep feedforward networks with piecewise linear activation (R. Pascanu, G. Montufar & Y. Bengio. 2017.)
- Towards Principled Unsupervised Learning (Ilya Sutskever et al. 2015. ICLR.)
- [Understanding Representations Learned in Deep Learning (D. Erhan et al. 2010.)] (https://github.com/pauli-space/foundations_for_deep_learning/blob/master/deep_learning/representation_learning/understanding_representations_in_deep_networks.pdf)
-
Deep Generative Models:
- Learning Deep Generative Models(Salakhutdinov. 2015. Annual Review of Statistics and Its Application.)
- Learning Disentangled Representations with Semi-Supervised Deep Generative Models (N. Siddarth et al. 2017.)
- Generative Adversarial Nets (I. Goodfellow et al. 2014. NIPS.) *On Unifying Deep Generative Models(Z. Hu et al. 2017.)
- Variational Approaches for Auto-Encoding Generative Adversarial Networks (M. Rosca et al. 2017.)
- Generative Moment Matching Networks (Y. Li et al. 2015.)
-
Continual Learning:
-
Hyperparameter Optimization:
- Taking the Human Out of the Loop: A Review of Bayesian Optimization (B. Shahriari et al. 2016. Proceedings of the IEEE.)
- Convolution by Evolution (C. Fernando et al. 2016. GECCO.)
- Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets (A. Klein et al. 2017.)
- Scalable Bayesian Optimization Using Deep Neural Networks (Jasper Snoek et al. 2015. ICML. )
-
Quantization:
- Bitwise Neural Networks (Minje Kim et al. 2016.)
- Expectation Propagation: Parameter-free training of multi-layer neural networks with continuous or discrete weights (D. Soudry et al. 2008. NIPS.)
- Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights (A. Zhou et al. 2017.)
- Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations (Itay Hubara et al. 2016.)
-
Optimisation:
- Simple Explanation of the No-Free-Lunch Theorem and Its Implications (Y. Ho. 2002. Journal of optimization theory and applications.)
- The Loss Surfaces of Multilayer Networks(Y LeCun et al. 2015. AISTATS.)
- The loss surface of deep and wide neural networks(Q. Nguyen 2017)
- Qualitatively Characterizing Neural Network Optimization Problems (I. Goodfellow et al. 2015. ICLR.)
- The Physical Systems behind Optimization (L. Yang et al. 2017.)
- A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method(W. Su 2016. Journal of Machine Learning Research.)
- Electron-Proton dynamics in deep learning(Zhang 2017. CoRR.)
- Sharp Minima Can Generalize for Deep Nets (L. Dinh et al. 2017. ICML.)
- Deep Learning without Poor Local Minima (K. Kawaguchi. 2016. NIPS.)
- Identifying and attacking the saddle point problem in high-dimensional non-convex optimization (Y. Dauphin et al. 2014. NIPS.)
- Recursive Decomposition for Nonconvex Optimization (A. Friesen and P. Domingos. 2016.)
- Sobolev Training for Neural Networks (W. Czarnecki et al. 2017.)
- Stochastic Gradient Descent as Approximate Bayesian Inference (S. Mandt, M. Hoffman & D. Blei. 2017)
- No bad local minima: Data independent training error guarantees for multilayer neural networks (Daniel Soudry, Yair Carmon. 2016.)
-
Representation Learning:
- A mathematical theory of Deep Convolutional Neural Networks for Feature Extraction(Wiatowski 2016. CoRR.)
- Spectral Representations for Convolutional Neural Networks(Rippl 2015. NIPS.)
- Provable bounds for learning some deep representations (Sanjeev Arora et al. 2013.)
- Spectrally-normalized margin bounds for neural networks (Peter Bartlett. 2017.)
- Exploring generalization in deep learning (Behnam Neyshabur et al. 2017.)
-
Learning theory:
- Distribution-Specific Hardness of Learning Neural Networks(Shamir 2017. CoRR.)
- Lessons from the Rademacher Complexity for Deep Learning(Sokolic 2016.ICLR.)
- On the ability of neural nets to express distributions (H. Lee et al. 2017.)
- Empirical Risk Minimization for Learning Theory(Vapnik 1991. NIPS.)
- Dataset Shift(Storkey 2013)
- On the ability of neural nets to express distributions (H. Lee, R. Ge, T. Ma, A. Risteski & S. Arora, 2017)
- Probably Approximately Correct Learning (R. Schapire. COS 511: Foundations of Machine Learning. 2006.)
- Rademacher Complexity (M. Balcan. CS 8803 - Machine Learning Theory. 2011.)
-
Learning behaviour:
-
Unsupervised Learning:
-
Generalisation:
- Shannon Information and Kolmogorov Complexity (Grunwald 2010)
- Discovering Neural Nets with Low Kolmogorov Complexity(Schmidhuber 1997. Neural Networks.)
- Opening the black box of Deep Neural Networks via Information (Schwartz-Ziv 2017.)
- On the emergence of invariance and disentangling in deep representations (A. Achille & S. Soatto. 2017.)
- On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models (J. Schmidhuber. 2015.)
- Towards an integration of deep learning and neuroscience(Marblestone 2016. Frontiers in Computational Neuroscience.)
- Equilibrium Propagation(Scellier 2016. Frontiers in Computational Neuroscience.)
- Towards Biologically plausible deep learning(Bengio 2015. CoRR.)
- Random synaptic feedback weights support error backpropagation for deep learning(Lillicrap 2016. Nature communications.)
- Towards deep learning with spiking neurons(Mesnard 2016. NIPS.)
- Towards deep learning with segregated dendrites(Guergiuev 2017)
- Variational learning for recurrent spiking networks(Rezende 2011. NIPS.)
- A view of Neural Networks as dynamical systems(Cessac 2009. I. J. Bifurcation and Chaos)
- Convolutional network layers map the function of the human visual system (M. Eickenberg. 2016. NeuroImage Elsevier.)
- Cortical Algorithms for Perceptual Grouping (P. Roelfsema. 2006. Annual Review of Neuroscience.)
- Temporally Efficient Deep Learning with Spikes (P. O'Connor, E. Gavves & M. Welling. 2017)
- Hierarchical Bayesian Inference in the visual cortex (T. Lee & D. Mumford. 2003.)
- Gradient Descent for Spiking Neural Networks (D. Huh & T. Sejnowski. 2017.)
- How Important Is Weight Symmetry in Backpropagation? (Qianli Liao, Joel Z. Leibo, Tomaso A. Poggio. 2016. AAAI.)
- Phase Transitions of Neural Networks (W. Kinzel. 1997. Universitat Weiburg.)
- Convolutional Neural Networks Arise From Ising Models and Restricted Boltzmann Machines (S. Pai)
- Non-equilibrium statistical mechanics: From a paradigmatic model to biological transport (T. Chou et al. 2011.)
- Replica Theory and Spin Glasses (F. Morone et al. 2014.)
Note 1: There are many who love quoting Richard Feynman and Albert Einstein whenever it suits their purpose. However, Feynman's popular quote: 'What I cannot create, I do not understand' has been taken out of context by many AI researchers. There are many things we can build that we can't understand and many things we can't build that we understand very well. Take any non-constructive proof in mathematical physics for example. From this it follows that it's important to create, but it's essential to understand. In fact, I think it makes more sense to consider the perspective of Marie Curie: "Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less."
Note 2: This is a work in progress. I have more papers to add.