This implements training of residual networks from Deep Residual Learning for Image Recognition by Kaiming He, et. al.
We wrote a more verbose blog post discussing this code, and ResNets in general here.
See the installation instructions for a step-by-step guide.
- Install Torch on a machine with CUDA GPU
- Install cuDNN v4 and the Torch cuDNN bindings
- Download the ImageNet dataset and move validation images to labeled subfolders
If you already have Torch installed, update nn
, cunn
, and cudnn
.
See the training recipes for addition examples.
The training scripts come with several options, which can be listed with the --help
flag.
th main.lua --help
To run the training, simply run main.lua. By default, the script runs ResNet-34 on ImageNet with 1 GPU and 2 data-loader threads.
th main.lua -data [imagenet-folder with train and val folders]
To train ResNet-50 on 4 GPUs:
th main.lua -depth 50 -batchSize 256 -nGPU 4 -nThreads 8 -shareGradInput true -data [imagenet-folder]
Trained ResNet 18, 34, 50, and 101 models are available for download. We include instructions for using a custom dataset, classifying an image and getting the model's top5 predictions, and for extracting image features using a pre-trained model.
The trained models achieve better error rates than the original ResNet models.
Network | Top-1 error | Top-5 error |
---|---|---|
ResNet-18 | 30.41 | 10.76 |
ResNet-34 | 26.73 | 8.74 |
ResNet-50 | 24.01 | 7.02 |
ResNet-101 | 22.44 | 6.21 |
This implementation differs from the ResNet paper in a few ways:
Scale augmentation: We use the scale and aspect ratio augmentation from Going Deeper with Convolutions, instead of scale augmentation used in the ResNet paper. We find this gives a better validation error.
Color augmentation: We use the photometric distortions from Andrew Howard in addition to the AlexNet-style color augmentation used in the ResNet paper.
Weight decay: We apply weight decay to all weights and biases instead of just the weights of the convolution layers.
Strided convolution: When using the bottleneck architecture, we use stride 2 in the 3x3 convolution, instead of the first 1x1 convolution.