/awesome-deep-learning-theory

A curated list of awesome Deep Learning theories that shed light on the mysteries of DL

Awesome Deep Learning Theory Awesome

Tutorials and other resources

Material: Homepage, Slide

Notes:

  • Going through the role of: convexity, overparameterization, role of depth, unsupervised/GAN, simpler networks (compression)
  • Present many important phenomenons and simple theories.

2. ICML Invited Talk: Max Welling - Intelligence per Kilowatthour

Materials: Youtube (not from ICML)

  • Various views, not only on intelligence per Kilowatthour, but also philosophical question about computational world.

3. ICML 2018 Invited Talk: Josh Tenenbaum - Building Machines that Learn and Think Like People

Materials: paper - 55p, ICML Video

  • His view on the importance of Probabilistic Programming Language (BayesFlow, ProbTorch, WebPPL...)
  • Symbolic is good for abstract reasoning.
  • The inuitive Physics engine in our mind (even new-born)
  • The hard problem of learning: learning program is hard (difficult loss function)
  • Big question: How to connect Symbolic representations and Neural Nets?

Papers at ICML 2018

Materials: pdf, code

  • Guided by Michael Jordan
  • Instancewise feature selection as a methodology for model interpretation
  • Learn a function to extract a subset of features that are most informative for each given example
  • Develop an efficient variational approximation to the mutual information

Materials: pdf, code

  • Nice combination between theory and practice
  • PAC-Bayes Bound (1999) -> PAC-Bayes Bound for Meta Learning (2014)
  • Stochastic Neural Nets with Posteriors and Priors are factorized Gaussians

Materials: pdf

  • Work of Christoph Lampert, who develped PAC-Bayes for Meta Learning.
  • Establish a data-dependent notion of algorithmic stability for SGD
  • Develop novel generalization bounds

Materials: pdf

  • A Stable Learning Algorithm: Change in data doesn't affect model too much
  • Stability -> Generalization
  • Establish novel generalization bounds for learning algorithms that converge to global minima.
  • Derive black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the empirical risk function

Materials: pdf, supp

  • Although being commonly viewed as a fast but not accurate version of gradient descent (GD), it always finds better solutions than GD for modern NNets.
  • To understand this phenomenon, we take an alternative view that SGD is working on the convolved (thus smoothed) version of the loss function.
  • Even if the function f has many bad local minima or saddle points, as long as for every point x, the weighted average of the gradients of its neighborhoods is one point convex with respect to the desired solution x∗, SGD will get close to, and then stay around x∗ with constant probability

Materials: pdf, code

  • Biggest challenges in the research of GANs is assessing the quality of generated samples and detecting various levels of mode collapse
  • Propose a novel measure of performance of a GAN by comparing geometrical properties of the underlying data manifold and the generated one, which provides both qualitative and quantitative means for evaluation.

Materials: pdf, code

  • NNets can be compressed to reduce memory and computational requirements, or to increase accuracy by facilitating the use of a larger base architecture.
  • Focus on pruning individual neurons, which can simultaneously trim model size, FLOPs, and run-time memory.
  • Utilize the information bottleneck principle instantiated via a tractable variational bound

Materials: pdf, supp

  • Build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators.
  • A large class of DNs can be written as a composition of max-affine spline operators (MASOs).
  • Deep Nets are (learned) matched filter banks.

Materials: pdf, supp

  • Important role of depth in deep learning has been emphasized
  • Argue that sufficient width of a feedforward network is equally important by answering the simple question under which conditions the decision regions of a neural network are connected.
  • Sufficiently wide hidden layer is necessary to guarantee that the network can produce disconnected decision regions

Materials: pdf, supp

  • Recent works use PAC-Bayes and Margin-based analyses, but do not as yet result in sample complexity bounds better than naive parameter counting
  • Use Noise stability -> Compress -> Better bounds
  • Data-dependent for generalization theory makes a stronger generalization bounds.

Materials: pdf, supp

  • An interesting approach to analyzing NNets that has received renewed attention is to examine the equivalent kernel of the neural network.
  • Fully connected feedforward network with one hidden layer, a certain weight distribution, an activation function, and an infinite number of neurons can be viewed as a mapping into a Hilbert space.
  • Derive the equivalent kernels of MLPs with ReLU or Leaky ReLU activations for all rotationally-invariant weight distributions

Materials: pdf, supp

Materials: pdf, supp

  • CNN: quivariance: translation
  • Can we generalize to other action gorups: rotation?
  • More general notion of convolution on groups
  • Fourier analysis on groups: spherical CNNs, Steerability and CNNs for manifolds
  • NNets for Graphs: Messasage passing Neural Net.

Materials: pdf, supp

  • Study the number of linear regions, i.e. pieces, that a PWL function represented by a DNN can attain, both theoretically and empirically.
  • Tighter upper and lower bounds for the maximum number of linear regions on rectifier networks.
  • Indicate that a deep rectifier network can only have more linear regions than every shallow counterpart with same number of neurons if that number exceeds the dimension of the input.

Materials: pdf, supp

  • Backpropagation-based visualizations have been proposed to interpret convolutional neural networks (CNNs), however a theory is missing to justify their behaviors.
  • Develop a theoretical explanation revealing that GBP and DeconvNet are essentially doing (partial) image recovery which is unrelated to the network decisions.

Materials: pdf, supp

  • Context: Everything in DL (CNN, LSTM, Relu, Resnet...) depends on Gradient Descent to find (good) local minima. This fails in setting as of game for GAN model where there are multiple interacting losses.
  • Problem: This paper analyzes why GD fails in multiple interacting losses games like GAN and propose a way to modify/construct loss to make GD succeed.
  • Approach:
    • Decompose the second-order dynamics into two components. The first is related to potential games, which reduces to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems.
    • Propose Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games.
  • Result: SGA is competitive for finding Local Nash equilibria in GANs.

Important Researchers