[Project Page
]
[Paper
]
[Models
]
[Codebase Demo Video
]
[BibTeX
]
Pytorch implementation and pre-trained models of Vision-LSTM (ViL), an adaption of xLSTM to computer vision.
This project is licensed under the MIT License, except the following folders/files, which are licensed under the AGPL-3.0 license:
- src/vislstm/modules/xlstm
- vision_lstm.py
The package vision_lstm provides a standalone implementation in the style of timm.
They can be loaded via a single line in torchhub
# load ViL-T
model = torch.hub.load("nx-ai/vision-lstm", "VisionLSTM")
# load your own model
model = torch.hub.load(
"nx-ai/vision-lstm",
"VisionLSTM",
dim=192, # latent dimension (192 for ViL-T)
depth=24, # how many ViL blocks
patch_size=16, # patch_size (results in 196 patches for 224x224 images)
input_shape=(3, 224, 224), # RGB images with resolution 224x224
output_shape=(1000,), # classifier with 1000 classes
drop_path_rate=0.05, # stochastic depth parameter
stride=None, # set to 8 for long-sequence fine-tuning
)
To setup the code-base, follow the instructions from SETUP.md. To start runs, follow the instructions from RUN.md.
Pre-trained models on ImageNet-1K can be loaded via torchhub or directly downloaded from here.
# pre-trained models (Table 1, left)
model = torch.hub.load("nx-ai/vision-lstm", "vil-tiny")
model = torch.hub.load("nx-ai/vision-lstm", "vil-tinyplus")
model = torch.hub.load("nx-ai/vision-lstm", "vil-small")
model = torch.hub.load("nx-ai/vision-lstm", "vil-smallplus")
model = torch.hub.load("nx-ai/vision-lstm", "vil-base")
# long-sequence fine-tuned models (Table 1, right)
model = torch.hub.load("nx-ai/vision-lstm", "vil-tinyplus-stride8")
model = torch.hub.load("nx-ai/vision-lstm", "vil-smallplus-stride8")
model = torch.hub.load("nx-ai/vision-lstm", "vil-base-stride8")
# tiny models trained for only 400 epochs (Appendix A.2)
model = torch.hub.load("nx-ai/vision-lstm", "vil-tiny-e400")
model = torch.hub.load("nx-ai/vision-lstm", "vil-tinyplus-e400")
An example of how to use these models can be found in eval.py which evaluates the models on the ImageNet-1K validation set.
Checkpoints for our reimplementation of DeiT-III-T are provided as raw checkpoint here.
This code-base is an improved version of the one used for MIM-Refiner for which there exists a demo video to explain various things.
If you like our work, please consider giving it a star ⭐ and cite us
@article{alkin2024visionlstm,
title={Vision-LSTM: xLSTM as Generic Vision Backbone},
author={Benedikt Alkin and Maximilian Beck and Korbinian P{\"o}ppel and Sepp Hochreiter and Johannes Brandstetter}
journal={arXiv preprint arXiv:2406.04303},
year={2024}
}