MLP-Mixer: An all-MLP Architecture for Vision
kblomdahl opened this issue · 7 comments
https://arxiv.org/pdf/2105.01601.pdf
Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers.
https://arxiv.org/pdf/2106.01548.pdf
Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pretraining and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rate). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3% and +11.0% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations. They also possess more perceptive attention maps.
Initial results with pure MLP-Mixer (without SAM) and the following parameters is not very promising. We also had to decrease the batch size to 128 in order to fit in the memory of our GPUs:
embeddings_size
is 722tokens_mlp_dims
is 256channels_mlp_dims
is 2048num_blocks
is 9tile_size
is 5 (with overlapping tiles, for a total of 49 tiles)
Analysis
Based on the initial attempts we see several challenges we need to address in the architecture:
- The
embeddings_size
needs to be very big due to the global average pooling used to get the intermediate representation. This causes runtime performance and memory problems. - The value head does not seem to converge despite the ownership head converging relatively well.
- It is not clear how
tile_size
should be tuned, we need to experiment.
Number of parameters
Another problem is that MLP-Mixer models are huge in comparison to the ResNet models currently deployed (and does not produce comparable results). The example graphs in the beginning of this post has an MLP-Mixer model that is approximately 11 times bigger than the ResNet model.
# | ResNet | MLP Mixer |
---|---|---|
9 | 9 · (2 · filter_size · filter_size · num_channels · num_channels ) = 2,654,208 |
9 · (2 · embeddings_size · tokens_mlp_dims + 2 · embeddings_size · channels_mlp_dims ) = 29,942,784 |
See Figure 1 for the final measurements for different tile sizes, when using a stride of tile_size // 2 + 1
. Results are the same as before, pretty damn bad. Oddly enough the ownership head has a much higher accuracy than the value head, but it does not transfer for some reason:
embeddings_size
is 722tokens_mlp_dims
is 361channels_mlp_dims
is 1444num_blocks
is 6
tile_size |
3 |
5 |
7 |
9 |
13 |
15 |
---|---|---|---|---|---|---|
Policy (Top 1) | 16.5% | 18.5% | 27.5% | 27.1% | 16.6% | 21.3% |
Policy (Top 3) | 30.1% | 33.7% | 46.0% | 46.7% | 30.5% | 38.2% |
Policy (Top 5) | 40.0% | 41.8% | 55.3% | 55.7% | 39.4% | 46.9% |
Value | 53.0% | 55.0% | 53.7% | 54.2% | 54.3% | 52.6% |
I also did some experiments by with setting the stride
to 1
but that did not show any better performance.
An alternative approach, inspired by Axial-SWideRNet [1], would be to just add a single block at the end of the convolutional residual network. This would serve as a global attention block just before the value and policy head.
The main drawback of MLP-Mixer blocks is that they require a lot of parameters, when computing such a global attention block to keep the same dimensions as a similar residual block:
embeddings_size
is128
(generally referred to asnum_channels
in our existing code base)tokens_mlp_dims
is361
channels_mlp_dims
is128
This yields a parameter count of 125,184
(compared to 294,912
for a single residual block). This should be feasible to add at the end of the convolutional stem. Alternatively after every n
residual blocks.
[1] Axial-DeepLab: StandAlone Axial-Attention for Panoptic Segmentation, https://arxiv.org/abs/2003.07853
As expected adding an mixing block improves the accuracy very slightly, but at the cost of parameters. A more "fair" comparison would be stem with (294912 * 9) / (294912 + 125184) = 6
residual blocks:
ref | at end of stem | every 1 residual block |
every 1 with 6 residual blocks |
|
---|---|---|---|---|
Policy (Top 1) | 53.1% | 51.8% | 53.1% | 52.6%% |
Policy (Top 3) | 76.9% | 75.9% | 77.3% | 76.6% |
Policy (Top 5) | 85.3% | 84.3% | 85.6% | 85.0% |
Value | 67.4% | 66.8% | 64.2% | 65.3% |
The main anomaly is the every 1
residual block value, which for some reason does not exhibit the same evaluation performance as it does training performance, which is a huge 70.7% accuracy vs a reference accuracy of 67.3%.
Overall adding MLP Mixer blocks to the stem shows good results, but there is a strong anomaly with the value head. All metrics except for it is converging properly, but it is not. Based on the training accuracy this is probably due to overfitting. Pink is the training accuracy, and green is the evaluation accuracy:
From a brief performance benchmark of a cuDNN implementation of a single MLP Mixer block, in addition to bottleneck blocks [1], compared to an standard residual block we see that they are pretty similar runtime performance. We might be able to squeeze out some additional performance for the mixer blocks by using a dedicated BLAS library for the dense layers instead of mimicking matrix multiplication by using a 1x1 convolution:
test layers::bottleneck_block::tests::bottleneck_block ... bench: 338,165 ns/iter (+/- 15,251)
test layers::mixer_block::tests::mixer_block ... bench: 425,699 ns/iter (+/- 20,147)
test layers::residual_block::tests::residual_block ... bench: 480,611 ns/iter (+/- 114,238)
These numbers would give the following comparative performance when put together into a full stem. So far the bottleneck architecture seems to be performance worse than the others (private run), but a 5x
stem of residual and mixer blocks might be attractive.
Configuration | Runtime (ns) |
---|---|
9x residual |
4,325,499 |
5x residual + mixer |
4,531,550 |
6x residual + mixer |
5,437,860 |
6x bottleneck+ mixer |
4,583,184 |
Not sure what changed, but with my latest runs with combined residual and MLP-mixer blocks I don't see any meaningful improvement anymore. I've tried a few different combination to see if we get any different results, where R
represents a residual block and M
an MLP-Mixer block:
Configuration | Policy (%) | Value (%) |
---|---|---|
RMRMRMRMR |
52.8% | 65.5% |
RRMRRMRRM |
52.8% | 66.1% |
MMMMMMRRR |
51.3% | 63.8% |
RRRRRRMMM |
53.2% | 67.4% |
These results seems to agree with the previously mentioned Axial-SWideRNet [1], in that it is mostly beneficial to put the transformer blocks at the end, but we need more a single MLP-Mixer block to get the results we want. In fact it might be better to retain the stem as entirely convolutional and push the MLP-Mixer's into the value and policy head since:
- Doing so will allow for better performance as we can down-sample the embedding size before-hand.
- Will allow the network to specialize each mixer for their respective head (may have negative effects in terms of regularization).
[1] Axial-DeepLab: StandAlone Axial-Attention for Panoptic Segmentation, https://arxiv.org/abs/2003.07853