kblomdahl/dream-go

MLP-Mixer: An all-MLP Architecture for Vision

kblomdahl opened this issue · 7 comments

https://arxiv.org/pdf/2105.01601.pdf

Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.

https://arxiv.org/pdf/2106.01548.pdf

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.

image

Initial results with pure MLP-Mixer (without SAM) and the following parameters is not very promising. We also had to decrease the batch size to 128 in order to fit in the memory of our GPUs:

  • embeddings_size is 722
  • tokens_mlp_dims is 256
  • channels_mlp_dims is 2048
  • num_blocks is 9
  • tile_size is 5 (with overlapping tiles, for a total of 49 tiles)

Analysis

Based on the initial attempts we see several challenges we need to address in the architecture:

  • The embeddings_size needs to be very big due to the global average pooling used to get the intermediate representation. This causes runtime performance and memory problems.
  • The value head does not seem to converge despite the ownership head converging relatively well.
  • It is not clear how tile_size should be tuned, we need to experiment.

Number of parameters

Another problem is that MLP-Mixer models are huge in comparison to the ResNet models currently deployed (and does not produce comparable results). The example graphs in the beginning of this post has an MLP-Mixer model that is approximately 11 times bigger than the ResNet model.

# ResNet MLP Mixer
9 9 · (2 · filter_size · filter_size · num_channels · num_channels) = 2,654,208 9 · (2 · embeddings_size · tokens_mlp_dims + 2 · embeddings_size · channels_mlp_dims) = 29,942,784

See Figure 1 for the final measurements for different tile sizes, when using a stride of tile_size // 2 + 1. Results are the same as before, pretty damn bad. Oddly enough the ownership head has a much higher accuracy than the value head, but it does not transfer for some reason:

  • embeddings_size is 722
  • tokens_mlp_dims is 361
  • channels_mlp_dims is 1444
  • num_blocks is 6
tile_size 3 5 7 9 13 15
Policy (Top 1) 16.5% 18.5% 27.5% 27.1% 16.6% 21.3%
Policy (Top 3) 30.1% 33.7% 46.0% 46.7% 30.5% 38.2%
Policy (Top 5) 40.0% 41.8% 55.3% 55.7% 39.4% 46.9%
Value 53.0% 55.0% 53.7% 54.2% 54.3% 52.6%

I also did some experiments by with setting the stride to 1 but that did not show any better performance.

An alternative approach, inspired by Axial-SWideRNet [1], would be to just add a single block at the end of the convolutional residual network. This would serve as a global attention block just before the value and policy head.

The main drawback of MLP-Mixer blocks is that they require a lot of parameters, when computing such a global attention block to keep the same dimensions as a similar residual block:

  • embeddings_size is 128 (generally referred to as num_channels in our existing code base)
  • tokens_mlp_dims is 361
  • channels_mlp_dims is 128

This yields a parameter count of 125,184 (compared to 294,912 for a single residual block). This should be feasible to add at the end of the convolutional stem. Alternatively after every n residual blocks.

[1] Axial-DeepLab: StandAlone Axial-Attention for Panoptic Segmentation, https://arxiv.org/abs/2003.07853

As expected adding an mixing block improves the accuracy very slightly, but at the cost of parameters. A more "fair" comparison would be stem with (294912 * 9) / (294912 + 125184) = 6 residual blocks:

ref at end of stem every 1 residual block every 1 with 6 residual blocks
Policy (Top 1) 53.1% 51.8% 53.1% 52.6%%
Policy (Top 3) 76.9% 75.9% 77.3% 76.6%
Policy (Top 5) 85.3% 84.3% 85.6% 85.0%
Value 67.4% 66.8% 64.2% 65.3%

The main anomaly is the every 1 residual block value, which for some reason does not exhibit the same evaluation performance as it does training performance, which is a huge 70.7% accuracy vs a reference accuracy of 67.3%.

Overall adding MLP Mixer blocks to the stem shows good results, but there is a strong anomaly with the value head. All metrics except for it is converging properly, but it is not. Based on the training accuracy this is probably due to overfitting. Pink is the training accuracy, and green is the evaluation accuracy:

image

From a brief performance benchmark of a cuDNN implementation of a single MLP Mixer block, in addition to bottleneck blocks [1], compared to an standard residual block we see that they are pretty similar runtime performance. We might be able to squeeze out some additional performance for the mixer blocks by using a dedicated BLAS library for the dense layers instead of mimicking matrix multiplication by using a 1x1 convolution:

test layers::bottleneck_block::tests::bottleneck_block ... bench:     338,165 ns/iter (+/- 15,251)
test layers::mixer_block::tests::mixer_block           ... bench:     425,699 ns/iter (+/- 20,147)
test layers::residual_block::tests::residual_block     ... bench:     480,611 ns/iter (+/- 114,238)

These numbers would give the following comparative performance when put together into a full stem. So far the bottleneck architecture seems to be performance worse than the others (private run), but a 5x stem of residual and mixer blocks might be attractive.

Configuration Runtime (ns)
9x residual 4,325,499
5x residual + mixer 4,531,550
6x residual + mixer 5,437,860
6x bottleneck+ mixer 4,583,184

Not sure what changed, but with my latest runs with combined residual and MLP-mixer blocks I don't see any meaningful improvement anymore. I've tried a few different combination to see if we get any different results, where R represents a residual block and M an MLP-Mixer block:

Configuration Policy (%) Value (%)
RMRMRMRMR 52.8% 65.5%
RRMRRMRRM 52.8% 66.1%
MMMMMMRRR 51.3% 63.8%
RRRRRRMMM 53.2% 67.4%

These results seems to agree with the previously mentioned Axial-SWideRNet [1], in that it is mostly beneficial to put the transformer blocks at the end, but we need more a single MLP-Mixer block to get the results we want. In fact it might be better to retain the stem as entirely convolutional and push the MLP-Mixer's into the value and policy head since:

  1. Doing so will allow for better performance as we can down-sample the embedding size before-hand.
  2. Will allow the network to specialize each mixer for their respective head (may have negative effects in terms of regularization).

[1] Axial-DeepLab: StandAlone Axial-Attention for Panoptic Segmentation, https://arxiv.org/abs/2003.07853