/caffenet-benchmark

Evaluation of the CNN design choices performance on ImageNet-2012.

Primary LanguageJupyter Notebook

Welcome to evaluation of CNN design choises performance on ImageNet-2012. Here you can find prototxt's of tested nets and full train logs.

upd.: Here is technical report version of this benchmark

If you use results from this benchmark, please cite

@Article{CaffeNetBench2017,
  Title                    = {Systematic evaluation of convolution neural network advances on the Imagenet },
  Author                   = {Dmytro Mishkin and Nikolay Sergievskiy and Jiri Matas},
  Journal                  = {Computer Vision and Image Understanding },
  Year                     = {2017},
  Doi                      = {https://doi.org/10.1016/j.cviu.2017.05.007},
  ISSN                     = {1077-3142},
  Keywords                 = {CNN},
  Url                      = {http://www.sciencedirect.com/science/article/pii/S1077314217300814}
}

**upd2.: Some of the pretrained models are in Releases section. They are licensed for unrestricted use.

***upd3.: Nice paper on noise sensitiveness: Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches

The basic architecture is similar to CaffeNet, but has several differences:

  1. Images are resized to small side = 128 for speed reasons. Therefore pool5 spatial size is 3x3 instead of 6x6.
  2. fc6 and fc7 layers have 2048 neurons instead of 4096.
  3. Networks are initialized with LSUV-init (code)
  4. Because LRN layers add nothing to accuracy (validated here), they were removed for speed reasons in most experiments.

Taking into account Neural Network Training Variations in Speech and Subsequent Performance Evaluation, results can vary from run to run (data order is the same, but random seeds are different). However, I haven`t experienced results difference for several CaffeNet-ReLU training runs.

On-going evaluations with graphs:

Activations

Name Accuracy LogLoss Comments
ReLU 0.470 2.36 With LRN layers
ReLU 0.471 2.36 No LRN, as in rest
TanH 0.401 2.78
1.73TanH(2x/3) 0.423 2.66 As recommended in Efficient BackProp, LeCun98
ArcSinH 0.417 2.71
VLReLU 0.469 2.40 y=max(x,x/3)
RReLU 0.478 2.32
Maxout 0.482 2.30 sqrt(2) narrower layers, 2 pieces. Same complexity, as for ReLU
Maxout 0.517 2.12 same width layers, 2 pieces
PReLU 0.485 2.29
ELU 0.488 2.28 alpha=1, as in paper
ELU 0.485 2.29 alpha=0.5
(ELU+LReLU) / 2 0.486 2.28 alpha=1, slope=0.05
SELU = Scaled ELU 0.470 2.38 1.05070 * ELU(x,alpha = 1.6732)
FReLU = ReLU + (learned) bias 0.488 2.27
[FELU = ELU + (learned) bias] 0.489 2.28
Shifted Softplus 0.486 2.29 Shifted BNLL aka softplus, y = log(1 + exp(x)) - log(2). Same as ELU, as expected
No, with max pooling 0.389 2.93 No non-linearity
No, no max pooling 0.035 6.28 No non-linearity, strided convolution
APL2 0.471 2.38 2 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
APL5 0.465 2.39 5 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
ConvReLU,FCMaxout2 0.490 2.26 ReLU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. Inspired by kaggle and INVESTIGATION OF MAXOUT NETWORKS FOR SPEECH RECOGNITION*
ConvELU,FCMaxout2 0.499 2.22 ELU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC.

The above analyses show that the bottom layers seem to waste a large portion of the additional parametrisation (figure 2 (a,e)) thus could be replaced, for example, by smaller ReLU layers. Similarly, maxout units in higher layers seem to use piecewise-linear components in a more active way suggesting the use of larger pools._

Prototxt, logs

Pooling type

Name Accuracy LogLoss Comments
MaxPool 0.471 2.36
Stochastic 0.438 2.54 Underfitting, may be try without Dropout
Stochastic, no dropout 0.429 2.96 Stoch pool does not prevent overfitting without dropout :(. Good start,bad finish
AvgPool 0.435 2.56
Max+AvgPool 0.483 2.29 Element-wise sum
NoPool 0.472 2.35 Strided conv2,conv3,conv4
General - - Depends on arch, click for details

Pooling window/stride

Name Accuracy LogLoss Comments
MaxPool 3x3/2 0.471 2.36 default alexnet
MaxPool 2x2/2 0.484 2.29 Leads to larger feature map, Pool5=4x4 instead of 3x3
MaxPool 3x3/2 pad1 0.488 2.25 Leads to even larger feature map, Pool5=5x5 instead of 3x3

Prototxt, logs

CLF architecture

Name Accuracy LogLoss Comments
Default ReLU 0.470 2.36 fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8
Conv5-fc6=2048C3_2048C1_clf_avg 0.494 2.34 no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> fc8 as 1x1 conv -> ave_pool.
Pool5-fc6=2048C3_2048C1_avg_clf 0.489 2.28 no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> ave_pool -> fc8
SPP2-FC-FC 0.471 2.36 pool5 = SPP with 2 levels (2x2 and 1x1) -> FC6 -> FC7
SPP3-FC-FC 0.483 2.30 pool5 = SPP with 3 levels (3x3 and 2x2 and 1x1) -> FC6 -> FC7
fc6=512C3_1024C3_1536C1 0.482 2.52 pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> 1x1x1536 -> fc8 as 1x1 conv -> ave_pool.
fc6=512C3_1024C3_1536C1_drop 0.491 2.29 pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> drop 0.3 -> 1x1x1536 -> drop 0.5-> fc8 as 1x1 conv -> ave_pool.
Default ReLU, 4096 0.497 2.24 fc6 = conv 3x3x4096 -> fc7 4096 -> 1000 fc8 == original caffenet

pool5pad following nets mistakenly were trained with ELU non-linearity instead of default ReLU

Name Accuracy LogLoss Comments
Default ELU 0.488 2.28 fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8
pool5pad_fc6ave 0.481 2.32 pool5 zero pad -> fc6 = conv 3x3x2048 -> AvePool -> as usual
pool5pad_fc6ave_fc7as1x1fc8ave 0.511 2.21 pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool.
pool5pad_fc6ave_fc7as1x1avefc8 0.508 2.22 pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> ave_pool -> fc8
pool5pad_fc6ave_fc7as1x1_avemax_fc8 0.509 2.19 pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool + max_pool.

Prototxt, logs

Conv1 parameters

Name Accuracy LogLoss Comments
Default, 128_K11_S4 0.471 2.36 Input size =128x128px, conv1 = 11x11x96, stride = 4
224_K11_S8 0.453 2.45 Input size =256x256px, conv1 = 11x11x96, stride = 8. Not finished yet
160_K11_S5 0.470 2.35 Input size =160x160px, conv1 = 11x11x96, stride = 5
96_K7_S3 0.459 2.43 Input size =96x96px, conv1 = 7x7x96, stride = 3
64_K5_S2 0.445 2.50 Input size =64x64px, conv1 = 5x5x96, stride = 2
32_K3_S1 0.390 2.84 Input size =32x32px, conv1 = 3x3x96, stride = 1
4x slower, 227_K11_S4 0.565 1.87 Input size = 227x227px, conv1 = 11x11x96, stride = 4, Not finished yet

prototxt, logs

Squeezing representation

For example, for using activations in image retrieval.

Name Accuracy LogLoss Comments
pool5pad_fc6ave_fc7as1x1fc8ave 0.508 2.22 Baseline. pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> ave_pool -> fc8 as 1x1 conv.
pool5pad_fc6ave_fc7as1x1=512_fc8ave 0.489 2.30 fc7 as 1x1 conv = 512
pool5pad_fc6ave_fc7as1x1_bottleneck=512_fc8ave 0.490 2.28 fc7 as 1x1 conv = 2048 then fc7a = 512

Prototxt, logs

Solvers

Name Accuracy LogLoss Comments
SGD with momentum 0.471 2.36
Nesterov 0.473 2.34
RMSProp 0.327 3.20 rms_decay=0.9, delta=1.0
RMSProp 0.453 2.45 rms_decay=0.9, delta=1.0, base_lr: 0.045, stepsize=10K. gamma=0.94 (from here)
RMSProp 0.451 2.43 rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=10K. gamma=0.94
RMSProp 0.472 2.36 rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=5K. gamma=0.94
RMSProp 0.486 2.28 rms_decay=0.9, delta=1.0, lr=0.1, linear lr_policy
SGD with momentum, linear 0.493 2.24 linear lr_policy

Not converge at all:

ADAM: lr=0.001 m=0.9 m2=0.999 delta=1e-8 lr=0.001 m=0.95 m2=0.999 delta=1e-8 lr=0.001 m=0.95 m2=0.999 delta=1e-7 lr=0.01 m=0.9 m2=0.999 delta=1e-8 lr=0.01 m=0.9 m2=0.999 delta=1e-7 lr=0.01 m=0.9 m2=0.999 delta=1e-9 lr=0.01 m=0.9 m2=0.99 delta=1e-8 lr=0.01 m=0.9 m2=0.999 delta=1e-8 lr=0.01 m=0.95 m2=0.999 delta=1e-9

AdaDelta: delta: 1e-5

RMSProp, lr=0.01, rms_decay=0.5 lr=0.01, rms_decay=0.9 lr=0.01, rms_decay=0.95 lr=0.01, rms_decay=0.98 lr=0.001, rms_decay=0.9 lr=0.001, rms_decay=0.98

Converge, but much worse that SGD: Adagrad, lr=0.01, lr=0.02 AdaDelta: delta: 1e-6, delta: 1e-7, delta: 1e-8 RMSProp, lr=0.01, rms_decay=0.99

Prototxt, logs

LR-policy

Name Accuracy LogLoss Comments
Step 100K 0.471 2.36 Default caffenet solver, max_iter=320K
Poly lr, p=0.5, sqrt 0.483 2.29 bvlc_quick_googlenet_solver, All the way worse than "step", leading at finish
Poly lr, p=2.0, sqr 0.483 2.299
Poly lr, p=1.0, linear 0.493 2.24
Poly lr, p=1.0, linear 0.466 2.39 max_iter=160K
Exp, 0.035 0.441 2.53 max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly

LR-policy-BatchNorm-Dropout = 0.2

Name Accuracy LogLoss Comments
Step 100K 0.527 2.09 Default caffenet solver, max_iter=320K
Poly lr, p=1.0, linear 0.496 2.24 max_iter=105K,
Poly lr, p=1.0, start_lr=0.02 0.505 2.21 max_iter=105K
Exp, 0.035 0.506 2.19 max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly

Prototxt, logs

Regularization

Name Accuracy LogLoss Comments
default 0.471 2.36 weight_decay=0.0005, L2, fc-dropout=0.5
wd=0.0001 0.450 2.48 weight_decay=0.0001, L2, fc-dropout=0.5
wd=0.00001 0.450 2.48 weight_decay=0.00001, L2, fc-dropout=0.5
wd=0.00001_L1 0.453 2.45 weight_decay=0.00001, L1, fc-dropout=0.5
drop=0.3 0.497 2.25 weight_decay=0.0005, L2, fc-dropout=0.3
drop=0.2 0.494 2.28 weight_decay=0.0005, L2, fc-dropout=0.2
drop=0.1 0.473 2.45 weight_decay=0.0005, L2, fc-dropout=0.1. Same acc, as in 0.5, but bigger logloss

Prototxt, logs

Dropout and width

Hypothesis about "same effective neurons = same performance" looks unvalidated

Name Accuracy LogLoss Comments
fc6,fc7=2048, dropout=0.5 0.471 2.36 default
fc6,fc7=2048, dropout=0.3 0.497 2.25 best for fc6,fc7=2048. (1-0.3)*2048=1433 neurons work each time
fc6,fc7=4096, dropout=0.65 0.465 2.38 (1-0.65)*4096=1433 neurons work each time
fc6,fc7=6144, dropout=0.77 0.447 2.48 (1-0.77)*6144=1433 neurons work each time
fc6,fc7=4096, dropout=0.5 0.497 2.24
fc6,fc7=1433, dropout=0 0.456 2.52

Prototxt, logs

Architectures

CaffeNet only

Name Accuracy LogLoss Comments
CaffeNet256 0.565 1.87 Reference BVLC model, LSUV init
CaffeNet128 0.470 2.36 Pool5 = 3x3
CaffeNet128_4096 0.497 2.24 Pool5 = 3x3, fc6-fc7=4096
CaffeNet128All 0.530 2.05 All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy
+ 0.06 Gain over vanilla caffenet128. "Sum of gains" = 0.018 + 0.013 + 0.015 + 0.003 + 0.013 + 0.023 = 0.085
SqueezeNet128 0.530 2.08 Reference solver, but linear lr_policy and batch_size=256 (320K iters). WITHOUT tricks like ELU, SPP, AVE+MAX, etc.
SqueezeNet128 0.547 2.08 New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc.
SqueezeNet224 0.592 1.80 New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc., 2 GPU
CaffeNet256All 0.613 1.64 All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy
CaffeNet128, no pad 0.411 2.70 No padding, but conv1 stride=2 instead of 4 to keep size of pool5 the same
CaffeNet128, dropout in conv 0.426 2.60 Dropout before pool2=0.1, after conv3 = 0.1, after conv4 = 0.2
CaffeNet128SPP 0.483 2.30 SPP= 3x3 + 2x2 + 1x1
DarkNet128BN 0.502 2.25 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
+ PreLU + base_lr=0.035, exp lr_policy, 160K iters
NiN128 0.519 2.15 Step lr_policy. Be carefull to not use dropout on maxpool in-place

Others

Name Accuracy LogLoss Comments
DarkNetBN 0.502 2.25 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
HeNet2x2 0.561 1.88 No SPP, Pool5 = 3x3, VLReLU, J' from paper
HeNet3x1 0.560 1.88 No SPP, Pool5 = 3x3, VLReLU, J' from paper, 2x2->3x1
GoogLeNet128 0.619 1.61 linear lr_policy, batch_size=256. obviously slower than caffenet
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn] 0.645 1.54 BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init.!!!! 5x5 replaced by two 3x3, no in-place
GoogLeNet128Res 0.634 1.56 linear lr_policy, batch_size=256. Resudial connections between inception block. No BN
GoogLeNet128Res_color 0.638 1.52 linear lr_policy, batch_size=256. Resudial connections between inception block. No BN. + color_trans3-10-3
googlenet_loss2_clf 0.571 1.80 from net above, aux classifier after inception_4d
googlenet_loss1_clf 0.520 2.06 from net above, aux classifier after inception_4a
fitnet1_elu 0.333 3.21
VGGNet16_128 0.651 1.46 Surprisingly much better that GoogLeNet128, even with step-based solver.
VGGNet16_128_All 0.682 1.47 ELU (a=0.5. a=1 leads to divergence :( ), avg+max pool, color conversion, linear lr_policy

ResNet attempts are moved to ResNets.md

ResNets, good attempts

Name Accuracy LogLoss Comments
ResNet-50ELU-2xThinner 0.616 1.63 Without BN, ELU, dropout=0.2 before classifier. 2x thinner, than in paper. Quite fast. No large overfitting (unlike upper table)
GoogLeNet-128 0.619 1.61 For reference. linear lr_policy, batch_size=256.
GoogLeNet128Res 0.634 1.56 linear lr_policy, batch_size=256. Resudial connections between inception block. No BN
VggLikeResNet-50-ELU-RoR-var 0.626 1.59 Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG, Residual on residual .
VggLikeResNet-50-ELU 0.632 1.57 Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. More RoR .
VggLikeResNet-50-ELU-RoR 1x5 0.628 1.58 Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. 1x5 layers
VggLikeResNet-50-ELU-RoR 1x3 0.631 1.58 Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG .

Train augmentation

Name Accuracy LogLoss Comments
Default 0.471 2.36 Random flip, random crop 128x128 from 144xN, N > 144
Drop 0.1 0.306 3.56 + Input dropout 10%. not finished, 186K iters result
Multiscale 0.462 2.40 Random flip, random crop 128x128 from ( 144xN, - 50%, 188xN - 20%, 256xN - 20%, 130xN - 10%)
5 deg rot 0.448 2.47 Random rotation to [0..5] degrees.

Prototxt, logs

Colorspace

Name Accuracy LogLoss Comments
RGB 0.471 2.36 default, no changes. Input = 0.04 * (Img - [104, 117,124])
RGB_by_BN 0.469 2.38 Input = BatchNorm(Img)
CLAHE 0.467 2.38 RGB -> LAB -> CLAHE(L)->RGB->BatchNorm(RGB)
HISTEQ 0.448 2.48 RGB -> HiestEq
YCrCb 0.458 2.42 RGB->YCrCb->BatchNorm(YCrCb)
HSV 0.451 2.46 RGB->HSV->BatchNorm(HSV)
Lab - - Doesn`t leave 6.90 loss after 1.5K iters
RGB->10->3 TanH 0.463 2.40 RGB -> conv1x1x10 tanh -> conv1x1x3 tanh
RGB->10->3 VlReLU 0.485 2.28 RGB -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
RGB->10->3 Maxout 0.488 2.26 RGB -> conv1x1x10 maxout(2) -> conv1x1x3 maxout(2)
RGB->16->3 VlReLU 0.483 2.30 RGB -> conv1x1x16 vlrelu -> conv1x1x3 vlrelu
RGB->3->3 VlReLU 0.480 2.32 RGB -> conv1x1x3 vlrelu -> conv1x1x3 vlrelu
RGB->10->3 VlReLU->sum(RGB) 0.482 2.30 RGB -> conv1x1x10 vlrelu -> conv1x1x3 -> sum(RGB) ->vlrelu
RGB and log(RGB)->10->3 VlReLU 0.482 2.29 RGB and log (RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
RGB and log(RGB) and log (256-RGB)->10->3 VlReLU 0.484 2.29 RGB and log (RGB) and log (256 - RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
NN-Scale 0.467 2.38 Nearest neightbor instead of linear interpolation for rescale. Faster, but worse :(
concat_rgb_each_pool 0.441 2.51 Concat avepoolRGB with each pool
OpenCV RGB2Gray 0.413 2.70 RGB->Grayscale Gray = 0.299 R + 0.587 G + 0.114 B
Learned RGB2Gray 0.419 2.66 RGB->conv1x1x1. Gray = -1.779 *R + 6.511 * G + 1.493 *B + 3.279

Prototxt, logs

Batch normalization

BN-paper, caffe-PR Note, that results are obtained without mentioned in paper y=kx+b additional layer.

BN -- before or after ReLU?

Name Accuracy LogLoss Comments
Before 0.474 2.35 As in paper
Before + scale&bias layer 0.478 2.33 As in paper
After 0.499 2.21
After + scale&bias layer 0.493 2.24

So in all next experiments, BN is put after non-linearity

BN and activations

Name Accuracy LogLoss Comments
ReLU 0.499 2.21
RReLU 0.500 2.20
PReLU 0.503 2.19
ELU 0.498 2.23
Maxout 0.487 2.28
Sigmoid 0.475 2.35
TanH 0.448 2.50
No 0.384 2.96

BN and dropout

ReLU non-linearity, fc6 and fc7 layer only

Name Accuracy LogLoss Comments
Dropout = 0.5 0.499 2.21
Dropout = 0.2 0.527 2.09
Dropout = 0 0.513 2.19

Prototxt, logs

BN-arch-init

Name Accuracy LogLoss Comments
Caffenet 0.471 2.36
Caffenet BN Before + scale&bias layer LSUV 0.478 2.33
Caffenet BN Before + scale&bias layer Ortho 0.482 2.31
Caffenet BN After LSUV 0.499 2.21
Caffenet BN After Ortho 0.500 2.20
Name Accuracy LogLoss Comments
GoogLeNet128 0.619 1.61
GoogLeNet BN Before + scale&bias layer LSUV 0.603 1.68
GoogLeNet BN Before + scale&bias layer Ortho 0.607 1.67
GoogLeNet BN After LSUV 0.596 1.70
GoogLeNet BN After Ortho 0.584 1.77
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn] 0.645 1.54 BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init, 5x5 replaced with 3x3 + 3x3. 3x3 replaced with 3x1+1x3

Prototxt, logs

Batch size, ReLU

Tanh results are moved [here] (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/BatchSize.md)

Name Accuracy LogLoss Comments
BS=1024, 4xlr 0.465 2.38 lr=0.04, 80K iters
BS=1024 0.419 2.65 lr=0.01, 80K iters
BS=512, 2xlr 0.469 2.37 lr=0.02, 160K iters
BS=512 0.455 2.46 lr=0.01, 160K iters
BS=256, default 0.471 2.36 lr=0.01, 320K iters
BS=128 0.472 2.35 lr=0.01, 640K iters
BS=128, 1/2 lr 0.470 2.36 lr=0.005, 640K iters
BS=64 0.471 2.34 lr=0.01, 1280K iters
BS=64, 1/4 lr 0.475 2.34 lr=0.0025, 1280K iters
BS=32 0.463 2.40 lr=0.01, 2560K iter
BS=32, 1/8 lr 0.470 2.37 lr=0.00125, 2560K iter
BS=1, 1/256 lr 0.474 2.35 lr=3.9063e-05, 81920K iter. Online training

Prototxt, logs

So general recommendation: too big batch_sizes leads to a bit inferior results, but in general batch_size should be selected based computation speed. If learning rate is adjusted, than no practial differenc e between different batch sizes.

From contributors

Base net is caffenet+BN+ReLU+drop=0.2 There difference in filters (main, 5x5 -> 3x3 + 3x3 or 1x5+5x1) and solver.

Name Accuracy LogLoss Comments
Base 0.527 2.09
Base_dereyly_lr, noBN, ReLU 0.441 2.53 max_iter=160K, stepsize=2K, gamma=0.915, but default caffenet
Base_dereyly 5x1, noBN, ReLU 0.474 2.31 5x5->1x5+5x1
Base_dereyly_PReLU 0.550 1.93 BN, PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->3x3+3x3
Base_dereyly 3x1 0.553 1.92 PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x3+1x3+3x1+1x3
Base_dereyly 3x1 scale aug 0.530 2.04 Same as previous, img: 128 crop from (128...300)px image, test resize to 144, crop 128
Base_dereyly 3x1 scale aug 0.512 2.17 Same as previous, img: 128 crop from (128...300)px image, test resize to (128+300)/2, crop 128
Base_dereyly 3x1->5x1 0.546 1.97* PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+1x5+5x1+1x5
Base_dereyly 3x1,halfBN 0.544 1.95 PreLU + base_lr=0.035, exp lr_policy, 160K iters,5x5->1x3+1x3+3x1+1x3, BN only for pool and fc6
Base_dereyly 5x1 0.540 2.00 PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+5x1
DarkNetBN 0.502 2.25 16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
+ PreLU + base_lr=0.035, exp lr_policy, 160K iters

Prototxt, logs

Residual experiments

Name Accuracy LogLoss Comments
VGG-Like 0.521 2.14 1st layer = 7x7 stride 2, unlike VGG. All other layer = 1/2 VGG width
VGG-LikeRes 0.576 1.83 with residual connections, no BN
VGG-LikeResDrop 0.568 1.91 with residual connections, no BN , dropout in conv

Prototxt, logs

Network width

Name Accuracy LogLoss Comments
4sqrt(2)x wider 0.565 1.96 Start overfitting
4x wider 0.563 1.92 Still no overfitting %)
2sqrt(2)x wider 0.552 1.94
2 wider 0.533 2.04
sqrt(2) wider 0.506 2.17
Default 0.471 2.36
sqrt(2)x narrower 0.460 2.41
2x narrower 0.416 2.68
2sqrt(2)x narrower 0.340 3.11 no group conv
2sqrt(2)x narrower 0.318 3.25
4x narrower 0.256 3.33

logs

Dataset size

Name Accuracy LogLoss Comments
Default, 1.2M images 0.471 2.36
800K images 0.438 2.54
600K images 0.425 2.63
400K images 0.393 2.92
200K images 0.305 4.04

Dataset size, no RGB scaling

Or why input var=1 for LSUV is so important

Name Accuracy LogLoss Comments
800K images 0.438 2.54
600K images 0.425 2.63
600K images, no scale 0.379 2.92
400K images 0.393 2.92
400K images, no scale 0.357 3.10
200K images 0.305 4.04
200K images, no scale 0.277 4.06

logs

Input image size

Name Accuracy LogLoss Comments
64x64 0.309 3.34
96x96 0.414 2.69
128x128 0.471 2.36
180x180 0.521 2.10
224x224 0.565 1.87
300x300 0.559 2.03 In progress, results for 115K

logs

Dataset quality

Name Accuracy LogLoss Comments
Default, clean labels 0.471 2.36
5% incorrect labels 0.458 2.45
10% incorrect labels 0.447 2.58
15% incorrect labels 0.437 2.69
50% incorrect labels 0.347 3.44

logs

Conv1 depth

Name Accuracy LogLoss Comments
Default, no 1x1 or 3x3 0.471 2.36 conv1 -> pool1
+ 1x1x96 NiN 0.490 2.24 conv1 -> 96C1 -> pool1
+ 3x (1x1x96 NiN) 0.509 2.10 conv1 -> 3x(96C1) -> pool1
+ 5x (1x1x96 NiN) 0.514 2.11 conv1 -> 5x(96C1) -> pool1
+ 7x (1x1x96 NiN) 0.514 2.11 conv1 -> 7x(96C1) -> pool1
+ 9x (1x1x96 NiN) 0.516 2.10 conv1 -> 9x(96C1) -> pool1
+ 9x (1x1x96 NiN)R 0.509 2.13 conv1 -> Residual9x(96C1) -> pool1. 276k iters
+ 1x (3x3x96 NiN) 0.500 2.19 conv1 -> 1x(96C3) -> pool1
+ 3x (3x3x96 NiN) 0.538 1.99 conv1 -> 1x(96C3) -> pool1
+ 5x (3x3x96 NiN) 0.551 1.91 conv1 -> 1x(96C3) -> pool1

logs

Other

ReLU non-linearity, fc6 and fc7 layer only

Name Accuracy LogLoss Comments
Default 0.471 2.36 bias lr_rate = 2x weights lr_rate
1x 0.470 2.37 bias lr_rate = 1x weights lr_rate
5x 0.472 2.35 bias lr_rate = 5x weights lr_rate
NoBias 0.445 2.50 Biases initialized with zeros, lr_rate = 0

Prototxt, logs

The PRs with test are welcomed

P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Anything vs. seconds" will give weird results.