This repository explores finetuning the DINOv2 (Oquab et al., 2024) encoder weights using Low-Rank Adaptation (Hu et al., 2021) (LoRA) and a simple 1x1 convolution decoder. LoRA makes it possible to finetune to new tasks easier without adjusting the original encoder weights by adding a small set of weights between each encoder block. The DINOv2 encoder weights are learned by self-supervised learning and capture the natural image domain accurately. For example, by just applying PCA to the outputs of the encoders we can already get a coarse segmentation of the objects in the image and see semanticly similar objects colored in the same color.
Check out the Explanation.ipynb
notebook for a more detailed walkthrough of the code and ideas behind it.
Install the packages using the requirements.txt
file.
# using conda
conda create --name dino python=3.11
conda activate dino
# Install the package for dino_finetune imports,
pip install -e .
In the section below I explain all the flags used in the main.py
to finetune to different datasets.
An example to run finetuning on the VOC dataset with LoRA and a FPN decoder.
python main.py --exp_name base_voc --dataset voc --size base --use_lora --img_dim 308 308 --epochs 50 --use_fpn
Flags Some explanation of the more useful flags to use when running experiments.
- --exp_name (str): The name of the experiment. This is used to identify the experiment and save results accordingly.
- --debug (flag): A boolean flag to indicate whether to debug the main.py training code.
- --dataset (str): The name of the dataset to use. either
voc
orade20k
- --size (str): The size configuration for the DINOv2 backbone parameter
small
,base
,large
, orgiant
- --r (int): the LoRA rank (r) parameter to determine the amount of parameters. Usually, a small value like 3-9.
- --use_lora (flag): A boolean flag indicating whether to use Low-Rank Adaptation (LoRA). If this flag is present, LoRA is used.
- --use_fpn (flag): A boolean flag to indicate whether to use the FPN decoder.
- --lora_weights (str): Path to the file location to load the LoRA weights en decoder head from.
- --img_dim (tuple of int): The dimensions of the input images (height width). This should be specified as two integers. Example: 308 308.
- --epochs (int): The number of training epochs. This determines how many times the model will pass through the entire training dataset. Example: 50.
There are some more unnamed parameters for training like the learning rate and batch size.
Pascal VOC
I achieve a validation mean IoU of approximately 85.2% using LoRA and a 1x1 convolution decoder. When applying ImageNet-C corruptions (Hendrycks & Dietterich, 2019) to test robustness on Pascal VOC, the validation mean IoU drops to 72.2% with corruption severity level 5 (the maximum). The qualitative performance of this network is illustrated in the figure below. Based on their qualitative and quantitative performance, these pre-trained weights handle image corruptions effectively.
You can use the pre-trained weights using the --lora_weights
flag or just using the load_parameters
function call. Registers here mean that extra context global context tokens are learned, see the second reference. All models are finetuned for a 100 epochs.
finetuned components | model | # of params |
with registers |
Pascal VOC Validation mIoU |
Pascal VOC-C level 5 Validation mIoU |
Directory |
---|---|---|---|---|---|---|
1x1 Conv decoder | ViT-L/14 distilled | 300 M | ✅ | 70.9% | 66.6% | output/base_voc_no_lora.pt |
LoRA + 1x1 Conv decoder | ViT-L/14 distilled | 300 M | ✅ | 85.2% | 72.2% | output/base_voc.pt |
LoRA + FPN decoder | ViT-L/14 distilled | 300 M | ✅ | 74.1% | 65.6% | output/base_voc_fpn.pt |
ADE20k
I achieve a validation mean IoU of approximately 62.2% using LoRA and a 1x1 convolution decoder. With ADE20k-C with corruption severity level 5, the validation mean IoU drops to 55.8%. The qualitative performance of this network is illustrated in the figure below.
finetuned components | model | # of params |
with registers |
ADE20k Validation mIoU |
ADE20k-C level 5 Validation mIoU |
Directory |
---|---|---|---|---|---|---|
1x1 Conv decoder | ViT-L/14 distilled | 300 M | ✅ | 57.2% | 54.4% | output/base_ade20k_no_lora.pt |
LoRA + 1x1 Conv decoder | ViT-L/14 distilled | 300 M | ✅ | 62.2% | 55.8% | output/base_ade20k_lora.pt |
LoRA + FPN decoder | ViT-L/14 distilled | 300 M | ✅ | 62.0% | 54.7% | output/base_ade20k_fpn.pt |
If you reference or use the codebase in your research, please cite:
@article{2024dinov2_lora_seg,
title={Finetuning DINOv2 with LoRA for Image Segmentation},
author={Rob van Gastel},
year={2024}
}
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024). DINOv2: Learning Robust Visual Features without Supervision (arXiv:2304.07193). arXiv. http://arxiv.org/abs/2304.07193
Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers (arXiv:2309.16588). arXiv. https://doi.org/10.48550/arXiv.2309.16588
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685). arXiv. http://arxiv.org/abs/2106.09685
Hendrycks, D., & Dietterich, T. G. (2019). Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations (arXiv:1807.01697). arXiv. https://doi.org/10.48550/arXiv.1807.01697