IDEA-Research/detrex

Frequently Asked Questions

rentainhe opened this issue · 12 comments

We keep this issue open to collect frequently asked questions and their solutions from the users.

Feel free to leave your comment here if you find any frequent issues and have ways to help others to solve them.

Notes

  • If you meed some convergence problem with less gpus, it's better to set a larger batch-size (batch-size=8/16) by setting dataloader.train.total_batch_size for training as mentioned in this issue: #219

FAQs

1. ImportError: Cannot import 'detrex._C', therefore 'MultiScaleDeformableAttention' is not available.

detrex need CUDA runtime to build the MultiScaleDeformableAttention operator. In most cases, users do not need to specify this environment variable if you have installed cuda correctly. The default path of CUDA runtime is usr/local/cuda. If you find your CUDA_HOME is None. You may solve it as follows:

  • If you've already installed CUDA runtime in your environments, specify the environment variable (here we take cuda-11.3 as an example):
export CUDA_HOME=/path/to/cuda-11.3/
  • If you do not find the CUDA runtime in your environments, consider install it following the CUDA Toolkit Installation to install CUDA. Then specify the environment variable CUDA_HOME.
  • After setting CUDA_HOME, rebuild detrex again by running pip install -e .

You can also refer to these issues for more details: #98, #85

2. How to not filter empty annotations during training.

There're three ways for you to not filter empty annotations during training.

  1. modify configs in configs/common/data/coco_detr.py as follows:
dataloader.train = L(build_detection_train_loader)(
    dataset=L(get_detection_dataset_dicts)(names="coco_2017_train", filter_empty=False),
    ...,
)
  1. modify configs in projects as dino_r50_4scale_24ep.py.
# your config.py
dataloader = get_config("common/data/coco_detr.py").dataloader

# modify dataloader config
# not filter empty annotations during training
dataloader.train.dataset.filter_empty = False
  1. modify your training scripts to override the config.
cd detrex
python tools/train_net.py --config-file projects/dino/configs/path/to/config.py --num-gpus 8 dataloader.train.dataset.filter_empy=False

You can also refer to these issues for more details: #78 (comment)

3. RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54980 (errno: 98 - Address already in use).

This means that the process you started earlier did not exit correctly, there's two solution:

  1. kill the process you started before totally
  2. change the running port by setting --dist-url
python tools/train_net.py \
    --config-file path/to/config.py \
    --num-gpus 8 \
    --dist-url tcp://127.0.0.1:12345 \
4. DINO CPU inference Please refer to this PR #157 for more details
5. Training coco-like custom dataset Please refer to this PR #186 for more details.

This should be added to the FAQ in the installation docs.

This should be added to the FAQ in the installation docs.

Thanks for your advice~ we will update the document later~

hg6185 commented

Hello,
I'm trying to install detrex on an hpc with Nvidia V100. I managed to set the path CUDA_HOME to path/CUDA/11.8.0

When I run the pip install -e . again, Im getting the following warning & error:

warning: nvcc warning : incompatible redefinition for option 'std', the last value of this option was used (I think this relates to one argument -std=c++17)

error:
/.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(73): error: identifier "_castu32_f32" is undefined

/.../miniconda3/envs/fps-bm/lib/python3.10/site-packages/torch/include/c10/util/Half.h(89): error: identifier "_castf32_u32" is undefined

2 errors detected in the compilation of "/.../detrex/detrex/layers/csrc/DCNv3/dcnv3_cuda.cu".
error: command '.../software/CUDA/11.8.0/bin/nvcc' failed with exit code 2

Did you ever encounter this and do you know a fix?
My gcc is 11.3 and supports c++17
Thanks in advance

Hello @hg6185

Seems like dcn_v3 operator not suitable for this environment, you can try this two ways:

  • search relative issue in InternImage repo here to see if there're same issues
  • remove this operator if you do not need to benchmark your model on InterImage backbone and re-compile detrex again

this is InternImage's official repo: https://github.com/OpenGVLab/InternImage

Seems like they already have python package for this operator: https://github.com/OpenGVLab/InternImage/releases/tag/whl_files

We will update detrex recently to remove such compiling process for this operator

hg6185 commented

Thanks for the quick reply @rentainhe!
Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue.
It seems to be a problem with c++ imports in PyTorch.

Thanks for the quick reply @rentainhe! Unfortunately, that's not the thing. I removed and reinstalled everything including detectron2 which now cannot be installed due to the same issue. It seems to be a problem with c++ imports in PyTorch.

I'm sorry to hear that. I suggest you could try lowering the PyTorch version to see if it helps to bypass this issue. @hg6185

hg6185 commented

Hi again @rentainhe ,
I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10.
In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!

Hi again @rentainhe , I found the problem. The Gcc version was incompatible with CUDA. Note that you should have a GCC that is < 10. In my case, everything works fine with CUDA 11.3.1 and GCC 9.4.0. Thanks again for the quick support!

Would you like to add this situation in our FAQs here: #109 (comment)

hg6185 commented

Hi @rentainhe ,

I can add this, but what do you mean? :D
Do you want me to write a comment that makes a little summary, so you can delete the rest?

Hi @rentainhe ,

I can add this, but what do you mean? :D Do you want me to write a comment that makes a little summary, so you can delete the rest?

Yes, I was wondering if it's better to add it to somewhere or just keep our conversation here to help others who have met the same problem

hg6185 commented

hi @rentainhe
a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.

In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older):
conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia

Don't forget the Nvidia Toolkit matching with your version.
Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version.
In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.

For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)

hi @rentainhe a summary of what fixed issue 1 for me: The 'latest' Detectron2 release requires a gcc version that is lower than 10.0.0. I am working on a HPC and I am able to load different CUDAs and GCCs which is practical in this case.

In order to build Detectron2 and Detrex, I used a miniconda env with CUDA 11.3.1 and gcc 9.4.0. I use PyTorch 3.8 which can be installed by this command (I post it here, because you will have to search for it since it's older): conda install pytorch torchvision torchaudio pytorch-cuda=11.3 -c pytorch -c nvidia

Don't forget the Nvidia Toolkit matching with your version. Note that there are some libs like matplotlib that needed to be deprecated to match an older gcc and Python version. In general, you probably will encounter some issues on the way, but I managed to find a solution to all of them.

For instance, If you get an error with pycocotools, do pip uninstall and conda install (from conda forge)

Thank you so much for summarizing this! It's really useful!