Denoising Vision Transformers

Jiawei Yang^1†* · Katie Z Luo^2†* · Jiefeng Li³ · Congyue Deng⁴
Leonidas Guibas⁴ · Dilip Krishnan⁵ · Kilian Q. Weinberger²
Yonglong Tian⁵ · Yue Wang¹

¹University of Southern California ²Cornell University
³Shanghai Jiaotong University ⁴Stanford University
⁵Google Research
†project lead *equal technical contribution contribution

📢 ECCV 2024 Oral 📢

This work presents Denoising Vision Transformers (DVT). It removes the visually annoying artifacts commonly seen in ViTs' feature maps and improves the downstream performance of dense recognition tasks.

News

2024-10-28: Code and Models are updated!
2024-07-01: DVT is accepted to ECCV 2024 as an Oral presentation!

Citation

@inproceedings{yang2024dvt,
  author = {Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas J. and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
  title = {DVT: Denoising Vision Transformers},
  journal = {ECCV},
  year = {2024},
}

This README file and codebase are legacy. We will update them soon.

Usage

Environment Setup

Per-Image Denoising and Denoiser Training

git clone https://github.com/Jiawei-Yang/Denoising-ViT.git
cd Denoising-ViT
conda create -n dvt python=3.10 -y
conda activate dvt
pip install -r requirements.txt

# Install `tiny-cuda-nn` manually:
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you want a single conda environment for different GPU architectures, install tiny-cuda-nn with a pre-defined architecture list:

# 7.0 for V100, 8.0 for A100, 8.6 for A40 or A6000
TORCH_CUDA_ARCH_LIST="7.0 8.0 8.6" pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Evaluation Environment

This section explains how to evaluate the denoised features on downstream tasks. We use mmsegmentation for dense prediction task evaluations on the VOC, ADE20k, and NYU-Depth datasets. If you don’t plan to evaluate on these tasks, you can skip this part.

Please note that mmsegmentation have some dependencies that may conflict with the dependencies in the main environment. To avoid this, we temporarily downgrade the CUDA and PyTorch versions to 11.7 for installation.

conda create -n dvt_eval python=3.10 -y
conda activate dvt_eval

# Install CUDA 11.7 or soft link CUDA 11.7 to /usr/local/cuda-11.7
CUDA_VERSION=11.7
export PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-${CUDA_VERSION}/lib64:$LD_LIBRARY_PATH
export CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}

# Full uninstallation
pip install -r requirements_eval.txt
pip uninstall mmcv-full -y && pip uninstall mmcv -y && pip cache purge

# Force CUDA installation
MCV_WITH_OPS=1 FORCE_CUDA=1 pip install mmcv-full==1.5.0 mmsegmentation==0.27.0

Pre-trained Models and Video Generation

Please refer to huggingface for the pre-trained models.

To generate demo videos similar to those in our website, you can simply run python make_video_demo.py

Data preparation

Our data folder should look like this:

data
├── ADEChallengeData2016
├── nyu
├── VOCdevkit
├── imagenet
└── voc_train.txt

PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images.

ADE20K: Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.
NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.
ImageNet (Optional):

Download the ImageNet dataset from http://www.image-net.org/
Extract data following the instructions at here.
Put the data in data/imagenet.

Run the code

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Results and Pre-trained Models

Please refer to huggingface for the pre-trained models.

Model Summary

We include 4 versions of models in this release:

voc_denoised: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.
voc_distilled: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:
1. Stage 1: Perform per-image denoising on the VOC datasets.
2. Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
3. Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
imgnet_denoised: The same as voc_denoised, but trained on the ImageNet-1k dataset.
imgnet_distilled: The same as voc_distilled, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.

Performance Summary

Baseline: The original ViT models.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_small_patch14_dinov2.lvd142m	81.78	88.44	44.05	55.53	0.4340	0.1331	84.49%
vit_base_patch14_dinov2.lvd142m	83.52	90.60	47.02	58.45	0.3965	0.1197	87.59%
vit_large_patch14_dinov2.lvd142m	83.43	90.38	47.53	59.64	0.3831	0.1145	88.89%
vit_small_patch14_reg4_dinov2.lvd142m	80.88	88.69	44.36	55.90	0.4328	0.1303	85.00%
vit_base_patch14_reg4_dinov2.lvd142m	83.48	90.95	47.73	60.17	0.3967	0.1177	87.92%
vit_large_patch14_reg4_dinov2.lvd142m	83.21	90.67	48.44	61.28	0.3852	0.1139	88.53%
deit3_base_patch16_224.fb_in1k	71.03	80.67	32.84	42.79	0.5837	0.1772	73.03%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	77.75	86.68	40.50	52.81	0.5585	0.1678	74.30%
vit_base_patch16_224.dino	62.92	75.98	31.03	40.62	0.5742	0.1694	74.55%
vit_base_patch16_224.mae	50.29	63.10	23.84	32.06	0.6629	0.2275	66.24%
eva02_base_patch16_clip_224.merged2b	71.49	82.69	37.89	50.31	-	-	-
vit_base_patch16_384.augreg_in21k_ft_in1k	73.51	83.60	36.46	48.65	0.6360	0.1898	69.10%

DVT (voc_denoised): The denoised models trained on the VOC dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_small_patch14_dinov2.lvd142m	82.78	90.69	45.14	56.35	0.4368	0.1337	84.34%
vit_base_patch14_dinov2.lvd142m	84.92	91.74	48.54	60.21	0.3811	0.1166	88.42%
vit_large_patch14_dinov2.lvd142m	85.25	91.69	49.80	61.98	0.3826	0.1118	89.32%
vit_small_patch14_reg4_dinov2.lvd142m	81.93	89.54	45.55	57.52	0.4251	0.1292	85.01%
vit_base_patch14_reg4_dinov2.lvd142m	84.58	91.17	49.24	61.66	0.3898	0.1146	88.60%
vit_large_patch14_reg4_dinov2.lvd142m	84.37	91.42	49.19	62.21	0.3852	0.1141	88.45%
deit3_base_patch16_224.fb_in1k	73.52	83.65	33.57	43.56	0.5817	0.1774	73.05%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	79.50	88.43	41.33	53.54	0.5512	0.1639	75.30%
vit_base_patch16_224.dino	66.41	77.75	32.45	42.42	0.5784	0.1738	73.75%
vit_base_patch16_224.mae	50.65	62.90	23.25	31.03	0.6651	0.2271	65.44%
eva02_base_patch16_clip_224.merged2b	73.76	84.50	37.99	50.40	0.6196	0.1904	69.86%
vit_base_patch16_384.augreg_in21k_ft_in1k	74.82	84.40	36.75	48.82	0.6316	0.1921	69.37%

DVT (voc_distilled): The distilled models trained on the VOC dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_base_patch14_dinov2.lvd142m	85.10	91.41	48.57	60.35	0.3850	0.1207	88.25%
vit_base_patch14_reg4_dinov2.lvd142m	84.36	90.80	49.20	61.56	0.3838	0.1143	88.97%
deit3_base_patch16_224.fb_in1k	73.63	82.74	34.43	44.96	0.5712	0.1747	74.00%
vit_base_patch16_clip_384.laion2b_ft_in12k_in1k	79.86	88.33	42.28	54.26	0.5253	0.1571	77.23%
vit_base_patch16_224.dino	66.80	78.47	32.68	42.58	0.5750	0.1696	73.86%
vit_base_patch16_224.mae	51.91	64.67	23.73	31.88	0.6733	0.2282	65.33%
eva02_base_patch16_clip_224.merged2b	75.93	85.44	40.15	52.04	-	-	-
vit_base_patch16_384.augreg_in21k_ft_in1k	76.26	85.14	38.62	50.61	0.5825	0.1768	73.14%

DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.

Model	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
vit_base_patch14_dinov2.lvd142m (denoised)	85.17	91.55	48.68	60.60	0.3832	0.1152	88.50%
vit_base_patch14_dinov2.lvd142m (distilled)	85.33	91.48	48.85	60.47	0.3704	0.1115	89.74%

A summary of DINOv2-base model is shown below:

vit_base_patch14_dinov2.lvd142m	VOC_mIoU	VOC_mAcc	ADE_mIoU	ADE_mAcc	NYU_RMSE	NYU_abs_rel	NYU_a1
baseline	83.52	90.60	47.02	58.45	0.3965	0.1197	87.59%
`voc_denoised`	84.92	91.74	48.54	60.21	0.3811	0.1166	88.42%
`voc_distilled`	85.10	91.41	48.57	60.35	0.3850	0.1207	88.25%
`imgnet_denoised`	85.17	91.55	48.68	60.60	0.3832	0.1152	88.50%
`imgnet_distilled`	85.33	91.48	48.85	60.47	0.3704	0.1115	89.74%

In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the cls token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.

However, we do not include this model in the final release because of the additional complexity but non-significant improvement.

Legacy Results

These are old results. We keep them here for reference.

VOC Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	50.24	88.02	63.15	log
MAE + DVT	50.53	88.06	63.29	log
DINO	63.00	91.38	76.35	log
DINO + DVT	66.22	92.41	78.14	log
Registers	83.64	96.31	90.67	log
Registers + DVT	84.50	96.56	91.45	log
DeiT3	70.62	92.69	81.23	log
DeiT3 + DVT	73.36	93.34	83.74	log
EVA	71.52	92.76	82.95	log
EVA + DVT	73.15	93.43	83.55	log
CLIP	77.78	94.74	86.57	log
CLIP + DVT	79.01	95.13	87.48	log
DINOv2	83.60	96.30	90.82	log
DINOv2 + DVT	84.84	96.67	91.70	log

ADE20K Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	23.60	68.54	31.49	log
MAE + DVT	23.62	68.58	31.25	log
DINO	31.03	73.56	40.33	log
DINO + DVT	32.40	74.53	42.01	log
Registers	48.22	81.11	60.52	log
Registers + DVT	49.34	81.94	61.70	log
DeiT3	32.73	72.61	42.81	log
DeiT3 + DVT	36.57	74.44	49.01	log
EVA	37.45	72.78	49.74	log
EVA + DVT	37.87	75.02	49.81	log
CLIP	40.51	76.44	52.47	log
CLIP + DVT	41.10	77.41	53.07	log
DINOv2	47.29	80.84	59.18	log
DINOv2 + DVT	48.66	81.89	60.24	log

NYU-D Evaluation Results

	RMSE	Rel	Logfile
MAE	0.6695	0.2334	log
MAE + DVT	0.7080	0.2560	log
DINO	0.5832	0.1701	log
DINO + DVT	0.5780	0.1731	log
Registers	0.3969	0.1190	log
Registers + DVT	0.3880	0.1157	log
DeiT3	0.588	0.1788	log
DeiT3 + DVT	0.5891	0.1802	log
EVA	0.6446	0.1989	log
EVA + DVT	0.6243	0.1964	log
CLIP	0.5598	0.1679	log
CLIP + DVT	0.5591	0.1667	log
DINOv2	0.4034	0.1238	log
DINOv2 + DVT	0.3943	0.1200	log