Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

The Mamba-360 framework is a collection of State Space Models in various Domains.

Awesome Arxiv Paper Project Page GitHub issues MIT: License

Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequences

  • Transformers have dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc., but they suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias for long sequences.

  • State Space Models (SSMs) have emerged as promising alternatives for sequence modeling paradigms, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagonal State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Long-Conv, Mega, HGRN, TNN, S5, Mamba, Mamba-former, Mamba-MoE etc.

-In this survey, we categorize foundational SSMs based on three paradigms: Structural architectures, Gating architectures, and Recurrent architectures.

Contents

  • The survey highlights crucial applications of SSMs for Long Sequence context across various domains, including:

    • Natural Language Processing (NLP)
    • Computer Vision
    • Medical Domain (including genomics, drug design)
    • Video Domain
    • Audio and Speech
    • Multi-Modal Domain
    • Tabular Domain
    • Recommendation System
    • Graph Domain
    • Time Series forecasting
  • The work consolidates the performance of SSMs on benchmark datasets in:

    • Text Domain: Long Range Arena (LRA), WikiText, Glue, Pile
    • Vision Domain: ImageNet1K
    • Video Dataset: Kinetics-400, SSTv2, Breakfast, COIN, LVU
    • Time Series Datasets: Electricity, Weather, Traffic, ETTh1, ETTh2, ETTm1, and ETTm2.

A Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges

Badri N. Patro, Vijay S. Agneeswaran Microsoft

@article{patro2024mamba,
  title={Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges},
  author={Patro, Badri Narayana and Agneeswaran, Vijay Srinivas},
  journal={arXiv preprint arXiv:2404.16112},
  year={2024}
}

Adavance State Space Models

Mamba 360

SSMs for Various Applications

SSM_Applications

Architectural Evolution

Architectural_Evolution

Basic Of SSM.

model

SSM SOTA on ImageNet-1K dataset with Image size 224 x 224.

This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset (Deng et al., 2009).

Table: SSM SOTA on ImageNet-1K. This table shows the performance of various SSM models for Image Recognition tasks on the ImageNet1K dataset. Models are grouped into three categories based on their GFLOPs. This table is adapted from the original source.

Method Image Size #Param. FLOPs Top-1 acc.
HyenaViT-B 224^2 88M - 78.5
S4ND-ViT-B 224^2 89M - 80.4
TNN-T - 6.4M - 72.29
TNN-S - 23.4M - 79.20
Vim-Ti 224^2 7M - 76.1
Vim-S 224^2 26M - 80.5
HGRN-T - 6.1M - 74.40
HGRN-S - 23.7M - 80.09
PlainMamba-L1 224^2 7M 3.0G 77.9
PlainMamba-L2 224^2 25M 8.1G 81.6
PlainMamba-L3 224^2 50M 14.4G 82.3
Mamba-2D-S 224^2 24M - 81.7
Mamba-2D-B 224^2 92M - 83.0
VMamba-T 224^2 22M 5.6G 82.2
VMamba-S 224^2 44M 11.2G 83.5
VMamba-B 224^2 75M 18.0G 83.2
LocalVMamba-T 224^2 26M 5.7G 82.7
LocalVMamba-S 224^2 50M 11.4G 83.7
SiMBA-S(Monarch) 224^2 18.5M 3.6G 81.1
SiMBA-B(Monarch) 224^2 26.9M 6.3G 82.6
SiMBA-L(Monarch) 224^2 42M 10.7G 83.8
ViM2-T 224^2 20M - 82.7
ViM2-S 224^2 43M - 83.7
ViM2-B 224^2 74M - 83.9
SiMBA-S(EinFFT) 224^2 15.3M 2.4G 81.7
SiMBA-B(EinFFT) 224^2 22.8M 5.2G 83.5
SiMBA-L(EinFFT) 224^2 36.6M 9.6G 84.4
SiMBA-S(MLP) 224^2 26.5M 5.0G 84.0
SiMBA-B(MLP) 224^2 40.0M 9.0G 84.7

State of the Art results of various vision models (Convnets, Transformers, SSMs) on ImageNet-1K dataset with Image size 224 x 224.

Table: SOTA on ImageNet-1K}The table shows the performance of various vision backbones on the ImageNet1K dataset for image recognition tasks. $\star$ indicates additionally trained with the Token Labeling for patch encoding. We have grouped the vision models into three categories based on their GFLOPs (Small, Base, and Large). The GFLOP ranges: Small (GFLOPs$<$5), Base (5$\leq$GFLOPs$<$10), and Large (10$\leq$GFLOPs$<$30). This table is adapted from the SiMBA paper.

Method Image Size #Param. FLOPs Top-1 acc.
Convnets
ResNet-101 $224^2$ 45M - 77.4
RegNetY-8G $224^2$ 39M 8.0G 81.7
ResNet-152 $224^2$ 60M - 78.3
RegNetY-16G $224^2$ 84M 16.0G 82.9
Transformers
DeiT-S $224^2$ 22M 4.6G 79.8
Swin-T $224^2$ 29M 4.5G 81.3
EffNet-B4 $380^2$ 19M 4.2G 82.9
WaveViT-H-S$^\star$ $224^2$ 22.7M 4.1G 82.9
SpectFormer-H-S$^\star$ $224^2$ 22.2M 3.9G 84.3
SVT-H-S$^\star$ $224^2$ 22M 3.9G 84.2
SCT-H-S$^\star$ $224^2$ 21.7M 4.1G 84.5
EffNet-B5 $456^2$ 30M 9.9G 83.6
Swin-S $224^2$ 50M 8.7G 83.0
CMT-B $224^2$ 45M 9.3G 84.5
MaxViT-S $224^2$ 69M 11.7G 84.5
iFormer-B $224^2$ 48M 9.4G 84.6
Wave-ViT-B$^\star$ $224^2$ 33M 7.2G 84.8
SpectFormer-H-B$^\star$ $224^2$ 33.1M 6.3G 85.1
SVT-H-B$^\star$ $224^2$ 32.8M 6.3G 85.2
SCT-H-B$^\star$ $224^2$ 32.5M 6.5G 85.2
M2-ViT-b $224^2$ 45M - 79.5
DeiT-B $224^2$ 86M 17.5G 81.8
Swin-B $224^2$ 88M 15.4G 83.5
M2-Swin-B $224^2$ 50M - 83.5
EffNet-B6 $528^2$ 43M 19.0G 84.0
MaxViT-B $224^2$ 120M 23.4G 85.0
VOLO-D3$^\star$ $224^2$ 86M 20.6G 85.4
Wave-ViT-L$^\star$ $224^2$ 57M 14.8G 85.5
SpectFormer-H-L$^\star$ $224^2$ 54.7M 12.7G 85.7
SVT-H-L$^\star$ $224^2$ 54.0M 12.7G 85.7
SCT-H-L$^\star$ $224^2$ 54.1M 13.4G 85.9
SSMs
Vim-Ti $224^2$ 7M - 76.1
PlainMamba-L1 $224^2$ 7M 3.0G 77.9
VMamba-T $224^2$ 22M 5.6G 82.2
SiMBA-S(Monarch) $224^2$ 18.5M 3.6G 81.1
Mamba-2D-S $224^2$ 24M - 81.7
SiMBA-S(EinFFT) $224^2$ 15.3M 2.4G 81.7
LocalVMamba-T $224^2$ 26M 5.7G 82.7
ViM2-T $224^2$ 20M - 82.7
SiMBA-S(MLP) $224^2$ 26.5M 5.0G 84.0
Vim-S $224^2$ 26M - 80.5
PlainMamba-L2 $224^2$ 25M 8.1G 81.6
SiMBA-B(Monarch) $224^2$ 26.9M 6.3G 82.6
Mamba-2D-B $224^2$ 92M - 83.0
SiMBA-B(EinFFT) $224^2$ 22.8M 5.2G 83.5
VMamba-S $224^2$ 44M 11.2G 83.5
LocalVMamba-S $224^2$ 50M 11.4G 83.7
ViM2-S $224^2$ 43M - 83.7
SiMBA-B(MLP) $224^2$ 40.0M 9.0G 84.7
HyenaViT-B $224^2$ 88M - 78.5
S4ND-ViT-B $224^2$ 89M - 80.4
PlainMamba-L3 $224^2$ 50M 14.4G 82.3
VMamba-B $224^2$ 75M 18.0G 83.2
SiMBA-L(Monarch) $224^2$ 42M 10.7G 83.8
ViM2-B $224^2$ 74M - 83.9
SiMBA-L(EinFFT) $224^2$ 36.6M 9.6G 84.4

State of the Art results of LRA benchmark tasks (Tay et al., 2020).

Table: Test accuracy on the LRA benchmark tasks (Tay et al., 2020). "✗" indicates the model did not exceed random guessing. The results for models ranging from Transformer to Performer are sourced from Tay et al. (2020). We compiled this table using data from the HGRN paper by Qin et al. (2023) and the S5 paper by Smith et al. (2022), consolidating the results into a unified presentation below.

Model ListOps Text Retrieval Image Pathfinder Path-X Avg.
Transformer 36.37 64.27 57.46 42.44 71.40 53.66
Local Attention 15.82 52.98 53.39 41.46 66.63 46.71
Sparse Trans. 17.07 63.58 59.59 44.24 71.71 51.03
Longformer 35.63 62.85 56.89 42.22 69.71 52.88
Linformer 35.70 53.94 52.27 38.56 76.34 51.14
Reformer 37.27 56.10 53.40 38.07 68.50 50.56
Sinkhorn Trans. 33.67 61.20 53.83 41.23 67.45 51.23
Synthesizer 36.99 61.68 54.67 41.61 69.45 52.40
BigBird 36.05 64.02 59.29 40.83 74.87 54.17
Linear Trans. 16.13 65.90 53.09 42.34 75.30 50.46
Performer 18.01 65.40 53.82 42.77 77.05 51.18
cosFormer 36.50 67.70 83.15 51.23 71.96 - 51.76
FLASH 38.70 64.10 86.10 47.40 70.25 - 51.09
FNet 35.33 65.11 59.61 38.67 77.80 54.42
Nyströmformer 37.15 65.52 79.56 41.58 70.94 57.46
Luna-256 37.25 64.57 79.29 47.38 77.72 59.37
H-Transformer-1D 49.53 78.69 63.99 46.05 68.78 61.41
CCNN 43.60 84.08 88.90 91.51 68.02
S4 58.35 76.02 87.09 87.26 86.05 88.10 80.48
DSSEXP 59.70 84.60 87.60 84.90 84.70 85.60 81.18
DSS(SOFTMAX) 60.60 84.80 87.80 85.70 84.60 87.80 81.88
S4D-LegS 60.47 86.18 89.46 88.19 93.06 91.95 84.89
Mega-chunk 58.76 90.19 90.97 85.80 94.41 93.81 85.66
S4-LegS 59.60 86.82 90.90 88.65 94.20 96.35 86.09
TNN 61.04 87.90 90.97 88.24 93.00 96.10 86.21
LRU 60.20 89.40 89.90 89.00 95.10 94.20 86.30
HGRN 59.95 88.14 94.23 88.69 92.92 97.50 86.91
SGConv 61.45 89.2 91.11 87.97 95.46 97.83 87.17
Liquid-S4 62.75 89.02 91.20 89.50 94.8 96.66 87.32
S5 62.15 89.31 91.40 88.00 95.33 98.58 87.46
Mega 63.14 90.43 91.25 90.44 96.01 97.98 88.21

Multivariate Time series benchmark Datasets

Table: Multivariate long-term forecasting results: It uses prediction lengths (T \in {96, 192, 336, 720}) for all the datasets for lookup window 96. The best results are in bold and the second best is underlined. This table is adapted from the SiMBA paper [@patro2024simba].

Datasets Models Simba TimesNet Crossformer PatchTST ETSFormer DLinear FEDFormer Autoformer Pyraformer MTGNN
Metric MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE
ETTm1 96 0.324 0.360 0.338 0.375 0.349 0.395 0.339 0.377 0.375 0.398 0.345 0.372 0.379 0.419 0.505 0.475 0.543 0.510 0.379 0.446
192 0.363 0.382 0.374 0.387 0.405 0.411 0.376 0.392 0.408 0.410 0.380 0.389 0.426 0.441 0.553 0.496 0.557 0.537 0.470 0.428
336 0.395 0.405 0.410 0.411 0.432 0.431 0.408 0.417 0.435 0.428 0.413 0.413 0.445 0.459 0.621 0.537 0.754 0.655 0.473 0.430
720 0.451 0.437 0.478 0.450 0.487 0.463 0.499 0.461 0.499 0.462 0.474 0.453 0.543 0.490 0.671 0.561 0.908 0.724 0.553 0.479
ETTm2 96 0.177 0.263 0.187 0.267 0.208 0.292 0.192 0.273 0.189 0.280 0.193 0.292 0.203 0.287 0.255 0.339 0.435 0.507 0.203 0.299
192 0.245 0.306 0.249 0.309 0.263 0.332 0.252 0.314 0.253 0.319 0.284 0.362 0.269 0.328 0.281 0.340 0.730 0.673 0.265 0.328
336 0.304 0.343 0.321 0.351 0.337 0.369 0.318 0.357 0.314 0.357 0.369 0.427 0.325 0.366 0.339 0.372 1.201 0.845 0.365 0.374
720 0.400 0.399 0.408 0.403 0.429 0.430 0.413 0.416 0.414 0.413 0.554 0.522 0.421 0.415 0.433 0.432 3.625 1.451 0.461 0.459
ETTh1 96 0.379 0.395 0.384 0.402 0.384 0.428 0.385 0.408 0.494 0.479 0.386 0.400 0.376 0.419 0.449 0.459 0.664 0.612 0.515 0.517
192 0.432 0.424 0.436 0.429 0.438 0.452 0.431 0.432 0.538 0.504 0.437 0.432 0.420 0.448 0.500 0.482 0.790 0.681 0.553 0.522
336 0.473 0.443 0.491 0.469 0.495 0.483 0.485 0.462 0.574 0.521 0.481 0.459 0.459 0.465 0.521 0.496 0.891 0.738 0.612 0.577
720 0.483 0.469 0.521 0.500 0.522 0.501 0.497 0.483 0.562 0.535 0.519 0.516 0.506 0.507 0.514 0.512 0.963 0.782 0.609 0.597
ETTh2 96 0.290 0.339 0.340 0.374 0.347 0.391 0.343 0.376 0.340 0.391 0.333 0.387 0.358 0.397 0.346 0.388 0.645 0.597 0.354 0.454
192 0.373 0.390 0.402 0.414 0.419 0.427 0.405 0.417 0.430 0.439 0.477 0.476 0.429 0.439 0.456 0.452 0.788 0.683 0.457 0.464
336 0.376 0.406 0.452 0.452 0.449 0.465 0.448 0.453 0.485 0.479 0.594 0.541 0.496 0.487 0.482 0.486 0.907 0.747 0.515 0.540
720 0.407 0.431 0.462 0.468 0.479 0.505 0.464 0.483 0.500 0.497 0.831 0.657 0.463 0.474 0.515 0.511 0.963 0.783 0.532 0.576
Electricity 96 0.165 0.253 0.168 0.272 0.185 0.288 0.159 0.268 0.187 0.304 0.197 0.282 0.193 0.308 0.201 0.317 0.386 0.449 0.217 0.318
192 0.173 0.262 0.198 0.300 0.211 0.312 0.195 0.296 0.212 0.329 0.209 0.301 0.214 0.329 0.231 0.338 0.376 0.443 0.260 0.348
336 0.188 0.277 0.198 0.300 0.211 0.312 0.195 0.296 0.212 0.329 0.209 0.301 0.214 0.329 0.231 0.338 0.376 0.443 0.260 0.348
720 0.214 0.305 0.220 0.320 0.223 0.335 0.215 0.317 0.233 0.345 0.245 0.333 0.246 0.355 0.254 0.361 0.376 0.445 0.290 0.369
Traffic 96 0.468 0.268 0.593 0.321 0.591 0.329 0.583 0.319 0.607 0.392 0.650 0.396 0.587 0.366 0.613 0.388 0.867 0.468 0.660 0.437
192 0.413 0.317 0.617 0.336 0.607 0.345 0.591 0.331 0.621 0.399 0.598 0.370 0.604 0.373 0.616 0.382 0.869 0.467 0.649 0.438
336 0.529 0.284 0.629 0.336 0.613 0.339 0.599 0.332 0.622 0.396 0.605 0.373 0.621 0.383 0.622 0.337 0.881 0.469 0.653 0.472
720 0.564 0.297 0.640 0.350 0.620 0.348 0.601 0.341 0.632 0.396 0.645 0.394 0.626 0.382 0.660 0.408 0.896 0.473 0.639 0.437
Weather 96 0.176 0.219 0.172 0.220 0.191 0.251 0.171 0.230 0.197 0.281 0.196 0.255 0.217 0.296 0.266 0.336 0.622 0.556 0.230 0.329
192 0.222 0.260 0.219 0.261 0.219 0.279 0.219 0.271 0.237 0.312 0.237 0.296 0.276 0.336 0.307 0.367 0.739 0.624 0.263 0.322
336 0.275 0.297 0.280 0.306 0.287 0.332 0.277 0.321 0.298 0.353 0.283 0.335 0.339 0.380 0.359 0.395 1.004 0.753 0.354 0.396
720 0.350 0.349 0.365 0.359 0.368 0.378 0.365 0.367 0.352 0.288 0.345 0.381 0.403 0.428 0.419 0.428 1.420 0.934 0.409 0.371

Comparison with SoTA methods on 8 benchmark datasets for Multimodal Applications

Benchmark names are abbreviated due to space limits. VQA-v2 (Goyal et al., 2017); GQA (Hudson and Manning, 2019); SQA-I: ScienceQA-IMG (Anonymous, 2022); VQA-T: TextVQA (Singh et al., 2019); POPE (Li et al., 2023); MME (Yin et al., 2023); MMB: MMBench (Liu et al., 2023); MM-Vet (Yu et al., 2023). PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively.

Table: Comparison with State-of-the-Art (SoTA) methods on 8 benchmarks. Benchmark names are abbreviated for space considerations. PT and IT indicate the number of samples in the pretraining and instruction tuning stages, respectively. This table is adapted from the VL-Mamba paper (Qiao et al., 2024).

Method LLM PT IT VQA-v2 GQA SQA-I VQA-T POPE MME MMB MM-Vet
BLIP-2 Vicuna-13B 129M - 41.0 41.0 61.0 42.5 85.3 1293.8 -- 22.4
MiniGPT-4 Vicuna-7B 5M 5K - 32.2 - - - 581.7 23.0 -
InstructBLIP Vicuna-7B 129M 1.2M -- 49.2 60.5 50.1 -- -- 36 26.2
InstructBLIP Vicuna-13B 129M 1.2M -- 49.5 63.1 50.7 78.9 1212.8 -- 25.6
Shikra Vicuna-13B 600K 5.5M 77.4 -- -- -- -- -- 58.8 --
Otter LLaMA-7B - - - - - - - 1292.3 48.3 24.6
mPLUG-Owl LLaMA-7B 2.1M 102K - - - - - 967.3 49.4 -
IDEFICS-9B LLaMA-7B 353M 1M 50.9 38.4 -- 25.9 -- -- 48.2 --
IDEFICS-80B LLaMA-65B 353M 1M 60.0 45.2 -- 30.9 -- -- 54.5 --
Qwen-VL Qwen-7B 1.4B 50M 78.8 59.3 67.1 63.8 -- -- 38.2 --
Qwen-VL-Chat Qwen-7B 1.4B 50M 78.2 57.5 68.2 61.5 -- 1487.5 60.6 --
LLaVA-1.5 Vicuna-7B 558K 665K 78.5 62.0 66.8 58.2 85.9 1510.7 64.3 30.5
LLaVA-1.5 Vicuna-13B 558K 665K 80.0 63.3 71.6 61.3 85.9 1531.3 67.7 35.4
LLaVA-Phi Phi-2-2.7B 558K 665K 71.4 - 68.4 48.6 85.0 1335.1 59.8 28.9
MobileVLM-3B MobileLLaMA-2.7B 558K 665K - 59.0 61.2 47.5 84.9 1288.9 59.6 -
Cobra Mamba-2.8B - - 75.9 58.5 - 46.0 88.0 - - -
VL-Mamba Mamba LLM-2.8B 558K 665K 76.6 56.2 65.4 48.9 84.4 1369.6 57.0 32.6