cs231n learning notes
Website: Convolutional Neural Networks for Visual Recognition (Spring 2017)
Video: CS231n Spring 2017
Course Syllabus
slides [done!!!]
- Computer vision overview
- Historical context
- Course logistics
video [done!!!]
slides [done!!!]
- The data-driven approach
- K-nearest neighbor
- Linear classification I
video [done!!!]
python/numpy tutorial [done!!!]
image classification notes [done!!!]
-
Intro to Image Classification, data-driven approach, pipeline
-
Nearest Neighbor Classifier
- k-Nearest Neighbor
-
Validation sets, Cross-validation, hyperparameter tuning
-
Pros/Cons of Nearest Neighbor
-
Summary
-
Summary: Applying kNN in practice
- If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA(wiki ref, CS229ref, blog ref )or even Random Projections.
-
Further Reading
Here are some (optional) links you may find interesting for further reading:
- A Few Useful Things to Know about Machine Learning, where especially section 6 is related but the whole paper is a warmly recommended reading.
- Recognizing and Learning Object Categories, a short course of object categorization at ICCV 2005.
linear classification notes [done!!!]
-
Intro to Linear classification
-
Linear score function
-
Interpreting a linear classifier
-
Loss function
- Multiclass SVM
-
- For example, it turns out that including the L2 penalty leads to the appealing max margin property in SVMs (See CS229 lecture notes for full details if you are interested).
- Softmax classifier
- SVM vs Softmax
-
Interactive Web Demo of Linear Classification
-
Summary
-
Further Reading
These readings are optional and contain pointers of interest.
-
- Deep Learning using Linear Support Vector Machines from Charlie Tang 2013 presents some results claiming that the L2SVM outperforms Softmax.
slides [done!!!]
- Linear classification II
- Higher-level representations, image features
- Optimization, stochastic gradient descent
video [done!!!]
linear classification notes [done!!!]
same to Lecture2: linear classification notes
optimization notes [done!!!]
-
Introduction
-
Visualizing the loss function
- a Stanford class on the topic convex optimization (other project)
- Subderivative
-
Optimization
- Strategy #1: Random Search
- Strategy #2: Random Local Search
- Strategy #3: Following the gradient
-
Computing the gradient
- Numerically with finite differences
- Analytically with calculus
-
Gradient descent
-
Summary
slides [done!!!]
- Backpropagation
- Multi-layer Perceptrons
- The neural viewpoint
video [done!!!]
backprop notes [done!!!]
- Introduction
- Simple expressions, interpreting the gradient
- Compound expressions, chain rule, backpropagation
- Intuitive understanding of backpropagation
- Modularity: Sigmoid example
- Backprop in practice: Staged computation
- Patterns in backward flow
- Gradients for vectorized operations
- Summary
- References
linear backprop example [done!!!]
derivatives notes (optional) [done!!!]
Efficient BackProp (optional) [done!!!]
Related (optional) [done!!!]
slides [done!!!]
-
History
-
Convolution and pooling
-
ConvNets outside vision
video [done!!!]
ConvNet notes [done!!!]
- Architecture Overview
- ConvNet Layers
- Convolutional Layer
- The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3].
- However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used BLAS API).
- As an aside, several papers use 1x1 convolutions, as first investigated by Network in Network.
- A recent development (e.g. see paper by Fisher Yu and Vladlen Koltun) is to introduce one more hyperparameter to the CONV layer called the dilation.
- Pooling Layer
- Many people dislike the pooling operation and think that we can get away without it. For example, Striving for Simplicity: The All Convolutional Net proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers.
- Normalization Layer
- For various types of normalizations, see the discussion in Alex Krizhevsky’s cuda-convnet library API.
- Fully-Connected Layer
- Converting Fully-Connected Layers to Convolutional Layers
- An IPython Notebook on Net Surgery shows how to perform the conversion in practice, in code (using Caffe)
- Convolutional Layer
- ConvNet Architectures
- Layer Patterns
- You should rarely ever have to train a ConvNet from scratch or design one from scratch. I also made this point at the Deep Learning school.
- Layer Sizing Patterns
- Case Studies (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet)
- LeNet (LeNet)
- AlexNet (AlexNet, ImageNet ILSVRC challenge)
- ZF Net (ZF Net)
- GoogLeNet (Szegedy et al, Inception-v4)
- VGGNet (VGGNet, pretrained model)
- ResNet (Residual Network, batch normalization, some recent experiments, Kaiming’s presentation (video, slides), Kaiming He et al. Identity Mappings in Deep Residual Networks (published March 2016))
- Computational Considerations
- Layer Patterns
- Additional References
- Soumith benchmarks for CONV performance
- ConvNetJS CIFAR-10 demo allows you to play with ConvNet architectures and see the results and computations in real time, in the browser.
- Caffe, one of the popular ConvNet libraries.
- State of the art ResNets in Torch7
slides [done!!!]
- Activation functions, initialization, dropout, batch normalization
video [done!!!]
Neural Nets notes 1 [done!!!]
-
Quick intro without brain analogies
-
Modeling one neuron
-
Biological motivation and connections
-
Single neuron as a linear classifier
-
Commonly used activation functions
- Tanh, Krizhevsky et al
- Leaky ReLU, Delving Deep into Rectifiers
- Maxout, One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.)
-
Neural Network architectures
-
Layer-wise organization
-
Example feed-forward computation
-
Representational power
- see Approximation by Superpositions of Sigmoidal Function from 1989 (pdf), or this intuitive explanation from Michael Nielsen
- much more involved and a topic of much recent research. If you are interested in these topics we recommend for further reading: - [x] Deep Learning book in press by Bengio, Goodfellow, Courville, in particular Chapter 6.4. - [x] Do Deep Nets Really Need to be Deep? - [x] FitNets: Hints for Thin Deep Nets
-
Setting number of layers and their sizes
- but some attempts to understand these objective functions have been made, e.g. in a recent paper The Loss Surfaces of Multilayer Networks.
-
Summary
-
Additional references
-
deeplearning.net tutorial with Theano
-
ConvNetJS demos for intuitions
-
Michael Nielsen’s tutorials
-
Neural Nets notes 2 [done!!!]
- Setting up the data and the model
- Data Preprocessing
- Weight Initialization
- Batch Normalization
- Regularization (L2/L1/Maxnorm/Dropout)
- Dropout, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Recommended further reading for an interested reader includes:
- Dropout paper by Srivastava et al. 2014.
- Dropout Training as Adaptive Regularization: “we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix”.
- Loss functions
- Summary
- Gradient checks
- Stick around active range of floating point. It’s a good idea to read through “What Every Computer Scientist Should Know About Floating-Point Arithmetic”
- Sanity checks
- Babysitting the learning process
- Loss function
- Train/val accuracy
- Weights:Updates ratio
- Activation/Gradient distributions per layer
- Visualization
- Parameter updates
- First-order (SGD), momentum, Nesterov momentum
- We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov’s Accelerated Momentum (NAG): - [x] Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5. - [x] Ilya Sutskever’s thesis (pdf) contains a longer exposition of the topic in section 7.2
- Annealing the learning rate
- Second-order methods
- Additional references: - [x] Large Scale Distributed Deep Networks is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. - [x] SFO algorithm strives to combine the advantages of SGD with advantages of L-BFGS.
- Per-parameter adaptive learning rates (Adagrad, RMSProp)
- Adagrad , is an adaptive learning rate method originally proposed by Duchi et al.
- RMSprop, everyone who uses this method in their work currently cites slide 29 of Lecture 6 of Geoff Hinton’s Coursera class.
- Adam, Adam is a recently proposed update that looks a bit like RMSProp with momentum.
- Unit Tests for Stochastic Optimization proposes a series of tests as a standardized benchmark for stochastic optimization.
- First-order (SGD), momentum, Nesterov momentum
- Hyperparameter Optimization
- Prefer random search to grid search. As argued by Bergstra and Bengio in Random Search for Hyper-Parameter Optimization, “randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid”
- Evaluation
- Model Ensembles
- Summary
- Additional References
- SGD tips and tricks from Leon Bottou
- Efficient BackProp (pdf) from Yann LeCun
- Practical Recommendations for Gradient-Based Training of Deep Architectures from Yoshua Bengio
- Stochastic Gradient Descent Tricks
- Efficient BackProp
- Practical Recommendations for Gradient-Based Training ofDeepArchitectures
- Deep learning
Assignment #1 due [done!!!]
k-Nearest Neighbor classifier [done!!!]
Training a Support Vector Machine [done!!!]
Implement a Softmax classifier [done!!!]
Two-Layer Neural Network [done!!!]
slides [done!!!]
video [done !!!]
Neural Nets notes 3 (same as the Lecture 6) [done!!!]
slides [done!!!]
-
Programming GPUs
video [done!!!]
slides [done!!!]
-
AlexNet : ImageNet Classification with Deep Convolutional Neural Networks
-
VGGNet : Very Deep Convolutional Networks for Large-Scale Image Recognition
-
GoogLeNet : Going Deeper with Convolutions
- Network in Network (NiN)
- Improving ResNets
- Beyond ResNets
video [done!!!]
slides [done!!!]
-
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
-
GRU Learning phrase representations using rnn encoder-decoder for statistical machine translation
video [done!!!]
-
Code: min-char-rnn
-
Code: char-rnn
-
Code: neuraltalk2
Assignment #2 [done!!!]
Q1: Fully-connected Neural Network [done!!!]
Q2: Batch Normalization [done!!!]
Q3: Dropou [done!!!]
Q4: Convolutional Networks [done!!!]
Q5: PyTorch on CIFAR-10 、TensorFlow on CIFAR-10 [done!!!]
slides [done!!!]
- Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
- Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
!!! Problem: Very inefficient! Not reusing shared features between overlapping patches
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
- Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015
- Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
- Toshev and Szegedy, “DeepPose: Human Pose Estimation via Deep Neural Networks”, CVPR 2014
Treat localization as a regression problem!
Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive!
- Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.
- Girshick, “Fast R-CNN”, ICCV 2015.
- Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015
- Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016
- Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016
- Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Aside: Object Detection + Captioning = Dense Captioning
- He et al, “Mask R-CNN”, arXiv 2017
Video [done!!!]
slides [done!!!]
- First Layer: Visualize Filters
Krizhevsky, “One weird trick for parallelizing convolutional neural networks”, arXiv 2014 He et al, “Deep Residual Learning for Image Recognition”, CVPR 2016 Huang et al, “Densely Connected Convolutional Networks”, CVPR 2017
- Last Layer: Nearest Neighbors、 Dimensionality Reduction
Krizhevsky et al, “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS 2012.
Van der Maaten and Hinton, “Visualizing Data using t-SNE”, JMLR 2008
- Visualizing Activations
Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2014.
- Occlusion Experiments
Zeiler and Fergus, “Visualizing and Understanding Convolutional Networks”, ECCV 2014
- Saliency Maps
Simonyan, Vedaldi, and Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014.
- Visualizing CNN features: Gradient Ascent
Simonyan, Vedaldi, and Zisserman, “Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps”, ICLR Workshop 2014.
Yosinski et al, “Understanding Neural Networks Through Deep Visualization”, ICML DL Workshop 2014.
Nguyen et al, “Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks”, ICML Visualization for Deep Learning Workshop 2016.
- Fooling Images / Adversarial Examples
- (1) Start from an arbitrary image
- (2) Pick an arbitrary class
- (3) Modify the image to maximize the class
- (4) Repeat until network is fooled
- DeepDream: Amplify existing features
Mordvintsev, Olah, and Tyka, “Inceptionism: Going Deeper into Neural Networks”, Google Research Blog.
- Feature Inversion
Mahendran and Vedaldi, “Understanding Deep Image Representations by Inverting Them”, CVPR 2015
Johnson, Alahi, and Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016. Copyright Springer, 2016.
- Neural Texture Synthesis
Gatys, Ecker, and Bethge, “Texture Synthesis Using Convolutional Neural Networks”, NIPS 2015
Johnson, Alahi, and Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016. Copyright Springer, 2016.
- Neural Style Transfer
Johnson, Alahi, and Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016.
Gatys, Ecker, and Bethge, “Texture Synthesis Using Convolutional Neural Networks”, NIPS 2015
Gatys, Ecker, and Bethge, “Image style transfer using convolutional neural networks”, CVPR 2016 Figure adapted from Johnson, Alahi, and Fei-Fei, “Perceptual Losses for Real-Time Style Transfer and Super-Resolution”, ECCV 2016.
Ulyanov et al, “Texture Networks: Feed-forward Synthesis of Textures and Stylized Images”, ICML 2016
Dumoulin, Shlens, and Kudlur, “A Learned Representation for Artistic Style”, ICLR 2017
video [done!!!]
slides [done!!!]
-
Unsupervised Learning
-
Generative Models
○ PixelRNN and PixelCNN
○ Variational Autoencoders (VAE)
○ Generative Adversarial Networks (GAN)
- PixelRNN
- PixelCNN
Kingma and Welling, “Auto-Encoding Variational Bayes”, ICLR 2014
Ian Goodfellow et al., “Generative Adversarial Nets”, NIPS 2014
- Generative Adversarial Nets: Convolutional Architectures
Radford et al, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, ICLR 2016
- See also: https://github.com/soumith/ganhacks for tips and tricks for trainings GANs