/from-EVP

Metric depth estimation from a single image and referring segmentation

Primary LanguagePythonMIT LicenseMIT

EVP

PWC
PWC
PWC

by Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka

This repository contains PyTorch implementation for paper "EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment".

Currently it contains only inference code and trained models. The training code for reproduction will be added soon.

EVP (Enhanced Visual Perception) builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks.

intro

Installation

Clone this repo, and run

git submodule init
git submodule update

Download the checkpoint of stable-diffusion (we use v1-5 by default) and put it in the checkpoints folder. Please also follow the instructions in stable-diffusion to install the required packages.

Referring Image Segmentation with EVP

EVP achieves 76.35 overall IoU and 77.61 mean IoU on the validation set of RefCOCO.

Please check refer.md for detailed instructions on training and inference.

Depth Estimation with EVP

EVP obtains 0.224 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.

RMSE d1 d2 d3 REL log_10
EVP 0.224 0.976 0.997 0.999 0.061 0.027

EVP obtains 0.048 REL and 0.136 SqREL on KITTI depth estimation benchmark, establishing the new state-of-the-art.

REL SqREL RMSE RMSE log d1 d2 d3
EVP 0.048 0.136 2.015 0.073 0.980 0.998 1.000

Please check depth.md for detailed instructions on training and inference.

License

MIT License

Acknowledgements

This code is based on stable-diffusion, mmsegmentation, LAVT, MIM-Depth-Estimation and VPD

Citation

If you find our work useful in your research, please consider citing:

@misc{lavreniuk2023evp,
  url = {https://arxiv.org/abs/2312.08548},
  author = {Lavreniuk, Mykola and Bhat, Shariq Farooq and Müller, Matthias and Wonka, Peter},
  title = {EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment},
  publisher = {arXiv},
  year = {2023},
}