/Denoising-ViT

This is the official code release for our work, Denoising Vision Transformers.

Primary LanguagePythonMIT LicenseMIT

DVT: Denoising Vision Transformers

2024-01-19: We will release our denoiser checkpoints within two weeks.

2024-02-05: Feel free to check out the preliminary checkpoints available here. These checkpoints have been denoised from 10k VOC samples and evaluated in the paper. We plan to update the instructions for using these checkpoints soon. For now, please refer to the video_generation.py script for an example of their usage. Currently, we are still working on checkpoints denoised from ImageNet, which we will share later.


This is the official code release for

Denoising Vision Transformers.

by Jiawei Yang†*, Katie Z Luo*, Jiefeng Li, Kilian Q. Weinberger, Yonglong Tian, and Yue Wang

Paper | Arxiv | Project Page

* equal technical contribution † project lead

Figure

Abstract

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean ViT features for offline applications. Furthermore, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, capable of generalizing to unseen data without the need for per-image optimization. Our two-stage approach, which we term as Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs, and is immediately applicable to any Transformer-based architectures. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.

TL;DR:

We identify crucial artifacts in ViTs caused by positional embeddings and propose a two-stage approach to remove these artifacts, which significantly improves the feature quality of different pre-trained ViTs.

Citation

@article{yang2024denoising,
  author = {Jiawei Yang and Katie Z Luo and Jiefeng Li and Kilian Q Weinberger and Yonglong Tian and Yue Wang},
  title = {Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}

Installation

  1. Create a conda environment.
conda create -n dvt python=3.9 -y
  1. Activate the environment.
conda activate dvt
  1. Install dependencies from requirements.txt.
pip install -r requirements.txt
  1. Install tiny-cuda-nn manually:
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error nvcc fatal : Unsupported gpu architecture compute_89, try the following command:

TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Data preparation

  1. PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,
mkdir -p data
cd data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images, and then randomly shuffling them.

  1. ADE20K: [legacy, need to check] Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.

  2. NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.

  3. ImageNet (Optional):

Run the code:

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: Figure From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Main Results and Checkpoints

VOC Evaluation Results

mIoU aAcc mAcc Logfile
MAE 50.24 88.02 63.15 log
MAE + DVT 50.53 88.06 63.29 log
DINO 63.00 91.38 76.35 log
DINO + DVT 66.22 92.41 78.14 log
Registers 83.64 96.31 90.67 log
Registers + DVT 84.50 96.56 91.45 log
DeiT3 70.62 92.69 81.23 log
DeiT3 + DVT 73.36 93.34 83.74 log
EVA 71.52 92.76 82.95 log
EVA + DVT 73.15 93.43 83.55 log
CLIP 77.78 94.74 86.57 log
CLIP + DVT 79.01 95.13 87.48 log
DINOv2 83.60 96.30 90.82 log
DINOv2 + DVT 84.84 96.67 91.70 log

ADE20K Evaluation Results

mIoU aAcc mAcc Logfile
MAE 23.60 68.54 31.49 log
MAE + DVT 23.62 68.58 31.25 log
DINO 31.03 73.56 40.33 log
DINO + DVT 32.40 74.53 42.01 log
Registers 48.22 81.11 60.52 log
Registers + DVT 49.34 81.94 61.70 log
DeiT3 32.73 72.61 42.81 log
DeiT3 + DVT 36.57 74.44 49.01 log
EVA 37.45 72.78 49.74 log
EVA + DVT 37.87 75.02 49.81 log
CLIP 40.51 76.44 52.47 log
CLIP + DVT 41.10 77.41 53.07 log
DINOv2 47.29 80.84 59.18 log
DINOv2 + DVT 48.66 81.89 60.24 log

NYU-D Evaluation Results

RMSE Rel Logfile
MAE 0.6695 0.2334 log
MAE + DVT 0.7080 0.2560 log
DINO 0.5832 0.1701 log
DINO + DVT 0.5780 0.1731 log
Registers 0.3969 0.1190 log
Registers + DVT 0.3880 0.1157 log
DeiT3 0.588 0.1788 log
DeiT3 + DVT 0.5891 0.1802 log
EVA 0.6446 0.1989 log
EVA + DVT 0.6243 0.1964 log
CLIP 0.5598 0.1679 log
CLIP + DVT 0.5591 0.1667 log
DINOv2 0.4034 0.1238 log
DINOv2 + DVT 0.3943 0.1200 log

Denoiser Checkpoints

[ ] To be released.