Finetuning DINOv2 with LoRA for Image Segmentation

This repository explores finetuning the DINOv2 (Oquab et al., 2024) encoder weights using Low-Rank Adaptation (Hu et al., 2021) (LoRA) and a simple 1x1 convolution decoder. LoRA makes it possible to finetune to new tasks easier without adjusting the original encoder weights by adding a small set of weights between each encoder block. The DINOv2 encoder weights are learned by self-supervised learning and capture the natural image domain accurately. For example, by just applying PCA to the outputs of the encoders we can already get a coarse segmentation of the objects in the image and see semanticly similar objects colored in the same color.

Check out the Explanation.ipynb notebook for a more detailed walkthrough of the code and ideas behind it.

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name dino python=3.11
conda activate dino
# Install the package for dino_finetune imports,
pip install -e .

In the section below I explain all the flags used in the main.py to finetune to different datasets.

Usage

An example to run finetuning on the VOC dataset with LoRA and a FPN decoder.

python main.py --exp_name base_voc --dataset voc --size base --use_lora --img_dim 308 308 --epochs 50 --use_fpn

Flags Some explanation of the more useful flags to use when running experiments.

--exp_name (str): The name of the experiment. This is used to identify the experiment and save results accordingly.
--debug (flag): A boolean flag to indicate whether to debug the main.py training code.
--dataset (str): The name of the dataset to use. either voc or ade20k
--size (str): The size configuration for the DINOv2 backbone parameter small, base, large, or giant
--r (int): the LoRA rank (r) parameter to determine the amount of parameters. Usually, a small value like 3-9.
--use_lora (flag): A boolean flag indicating whether to use Low-Rank Adaptation (LoRA). If this flag is present, LoRA is used.
--use_fpn (flag): A boolean flag to indicate whether to use the FPN decoder.
--lora_weights (str): Path to the file location to load the LoRA weights en decoder head from.
--img_dim (tuple of int): The dimensions of the input images (height width). This should be specified as two integers. Example: 308 308.
--epochs (int): The number of training epochs. This determines how many times the model will pass through the entire training dataset. Example: 50.

There are some more unnamed parameters for training like the learning rate and batch size.

Results

Pascal VOC
I achieve a validation mean IoU of approximately 85.2% using LoRA and a 1x1 convolution decoder. When applying ImageNet-C corruptions (Hendrycks & Dietterich, 2019) to test robustness on Pascal VOC, the validation mean IoU drops to 72.2% with corruption severity level 5 (the maximum). The qualitative performance of this network is illustrated in the figure below. Based on their qualitative and quantitative performance, these pre-trained weights handle image corruptions effectively.

You can use the pre-trained weights using the --lora_weights flag or just using the load_parameters function call. Registers here mean that extra context global context tokens are learned, see the second reference. All models are finetuned for a 100 epochs.

finetuned components	model	# of params	with registers	Pascal VOC Validation mIoU	Pascal VOC-C level 5 Validation mIoU	Directory
1x1 Conv decoder	ViT-L/14 distilled	300 M	✅	70.9%	66.6%	output/base_voc_no_lora.pt
LoRA + 1x1 Conv decoder	ViT-L/14 distilled	300 M	✅	85.2%	72.2%	output/base_voc.pt
LoRA + FPN decoder	ViT-L/14 distilled	300 M	✅	74.1%	65.6%	output/base_voc_fpn.pt

ADE20k
I achieve a validation mean IoU of approximately 62.2% using LoRA and a 1x1 convolution decoder. With ADE20k-C with corruption severity level 5, the validation mean IoU drops to 55.8%. The qualitative performance of this network is illustrated in the figure below.

finetuned components	model	# of params	with registers	ADE20k Validation mIoU	ADE20k-C level 5 Validation mIoU	Directory
1x1 Conv decoder	ViT-L/14 distilled	300 M	✅	57.2%	54.4%	output/base_ade20k_no_lora.pt
LoRA + 1x1 Conv decoder	ViT-L/14 distilled	300 M	✅	62.2%	55.8%	output/base_ade20k_lora.pt
LoRA + FPN decoder	ViT-L/14 distilled	300 M	✅	62.0%	54.7%	output/base_ade20k_fpn.pt

Citing

If you reference or use the codebase in your research, please cite:

@article{2024dinov2_lora_seg,
      title={Finetuning DINOv2 with LoRA for Image Segmentation},
      author={Rob van Gastel},
      year={2024}
    }

References

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision (arXiv:2304.07193). arXiv. http://arxiv.org/abs/2304.07193

Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers (arXiv:2309.16588). arXiv. https://doi.org/10.48550/arXiv.2309.16588

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685