RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [2, 14]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True)
装了torch1.10 ,环境都放在env.yml
文件中。解决了
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc
pytorch、cuda和python版本不匹配的问题
重新安装transformers
RuntimeError: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True)
or at::Context::setDeterministicAlgorithms(true)
, but this operation is not deterministic because it uses CuBLAS and you ha and fine-tuning on tasks requiring fine-grainedve CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility
python-BaseException
This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.
We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).
TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.
For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.
Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.
The requirements file has all the dependencies that are needed by MDETR.
We provide instructions how to install dependencies via conda. First, clone the repository locally:
git clone https://github.com/ashkamath/mdetr.git
Make a new conda env and activate it:
conda create -n mdetr_env python=3.8
conda activate mdetr_env
Install the the packages in the requirements.txt:
pip install -r requirements.txt
Multinode training
Distributed training is available via Slurm and submitit:
pip install submitit
The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.
The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.
Backbone | GQA | Flickr | Refcoco | Url |
Size |
|||||
---|---|---|---|---|---|---|---|---|---|---|
AP | AP | R@1 | AP | Refcoco R@1 | Refcoco+ R@1 | Refcocog R@1 | ||||
1 | R101 | 58.9 | 75.6 | 82.5 | 60.3 | 72.1 | 58.0 | 55.7 | model | 3GB |
2 | ENB3 | 59.5 | 76.6 | 82.9 | 57.6 | 70.2 | 56.7 | 53.8 | model | 2.4GB |
3 | ENB5 | 59.9 | 76.4 | 83.7 | 61.8 | 73.4 | 58.8 | 57.1 | model | 2.7GB |
Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions
Backbone | Pre-training Image Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test R@5 | Test R@10 | url | size |
---|---|---|---|---|---|---|---|---|---|
Resnet-101 | COCO+VG+Flickr | 82.5 | 92.9 | 94.9 | 83.4 | 93.5 | 95.3 | model | 3GB |
EfficientNet-B3 | COCO+VG+Flickr | 82.9 | 93.2 | 95.2 | 84.0 | 93.8 | 95.6 | model | 2.4GB |
EfficientNet-B5 | COCO+VG+Flickr | 83.6 | 93.4 | 95.1 | 84.3 | 93.9 | 95.8 | model | 2.7GB |
Backbone | Pre-training Image Data | Val R@1 | Val R@5 | Val R@10 | Test R@1 | Test R@5 | Test R@10 | url | size |
---|---|---|---|---|---|---|---|---|---|
Resnet-101 | COCO+VG+Flickr | 82.3 | 91.8 | 93.7 | 83.8 | 92.7 | 94.4 | model | 3GB |
Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions
Backbone | Pre-training Image Data | Val | TestA | TestB | url | size |
---|---|---|---|---|---|---|
Resnet-101 | COCO+VG+Flickr | 86.75 | 89.58 | 81.41 | model | 3GB |
EfficientNet-B3 | COCO+VG+Flickr | 87.51 | 90.40 | 82.67 | model | 2.4GB |
Backbone | Pre-training Image Data | Val | TestA | TestB | url | size |
---|---|---|---|---|---|---|
Resnet-101 | COCO+VG+Flickr | 79.52 | 84.09 | 70.62 | model | 3GB |
EfficientNet-B3 | COCO+VG+Flickr | 81.13 | 85.52 | 72.96 | model | 2.4GB |
Backbone | Pre-training Image Data | Val | Test | url | size |
---|---|---|---|---|---|
Resnet-101 | COCO+VG+Flickr | 81.64 | 80.89 | model | 3GB |
EfficientNet-B3 | COCO+VG+Flickr | 83.35 | 83.31 | model | 2.4GB |
Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions
Backbone | M-IoU | Precision @0.5 | Precision @0.7 | Precision @0.9 | url | size |
---|---|---|---|---|---|---|
Resnet-101 | 53.1 | 56.1 | 38.9 | 11.9 | model | 1.5GB |
EfficientNet-B3 | 53.7 | 57.5 | 39.9 | 11.9 | model | 1.2GB |
Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions
Backbone | Test-dev | Test-std | url | size |
---|---|---|---|---|
Resnet-101 | 62.48 | 61.99 | model | 3GB |
EfficientNet-B5 | 62.95 | 62.45 | model | 2.7GB |
Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions
Data | AP | AP 50 | AP r | APc | AP f | url | size |
---|---|---|---|---|---|---|---|
1% | 16.7 | 25.8 | 11.2 | 14.6 | 19.5 | model | 3GB |
10% | 24.2 | 38.0 | 20.9 | 24.9 | 24.3 | model | 3GB |
100% | 22.5 | 35.2 | 7.4 | 22.7 | 25.0 | model | 3GB |
Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions
Overall Accuracy | Count | Exist |
Compare Number | Query Attribute | Compare Attribute | Url | Size |
---|---|---|---|---|---|---|---|
99.7 | 99.3 | 99.9 | 99.4 | 99.9 | 99.9 | model | 446MB |
MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.
If you find this repository useful please give it a star and cite as follows! :) :
@article{kamath2021mdetr,
title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
journal={arXiv preprint arXiv:2104.12763},
year={2021}
}