AIR-DISCOVER/DQS3D

For Virtual Environment Configuration

zschanghai opened this issue · 3 comments

hello, author of DQS3D:

I have changed the version of pytorch 1.10.2, torchvision 0.11.3 and CUDA 11.3 according to your advice. But the project cannot run successful, I present the docker file I refered for the configuration as follows:

Is it possible that the cause of the problem is the incorrect version of MMCV, and the other dependency?

FROM pytorch/pytorch:1.8.1-cuda10.2-cudnn7-devel

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys A4B469963BF863CC &&
apt-get update &&
apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6

Install MMCV, MMDetection and MMSegmentation
RUN pip install mmcv-full==1.3.8 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.2/index.html
RUN pip install mmdet==2.14.0
RUN pip install mmsegmentation==0.14.1

Install MMDetection3D
RUN conda clean --all
RUN git clone https://github.com/samsunglabs/fcaf3d.git /mmdetection3d
WORKDIR /mmdetection3d
ENV FORCE_CUDA="1"
RUN pip install -r requirements/build.txt
RUN pip install --no-cache-dir -e .

Install Minkowski Engine
RUN apt-get install -y python3-dev libopenblas-dev
RUN pip install ninja==1.10.2.3
RUN pip install
-U git+https://github.com/NVIDIA/MinkowskiEngine@v0.5.4
--install-option="--blas=openblas"
--install-option="--force_cuda"
-v
--no-deps

Install differentiable IoU
RUN git clone https://github.com/lilanxiao/Rotated_IoU /rotated_iou
WORKDIR /rotated_iou
RUN git checkout 3bdca6b20d981dffd773507e97f1b53641e98d0a
RUN cp -r /rotated_iou/cuda_op /mmdetection3d/mmdet3d/ops/rotated_iou
WORKDIR /mmdetection3d/mmdet3d/ops/rotated_iou/cuda_op
RUN python setup.py install
WORKDIR /mmdetection3d

Your Sincerely!

We hope that a new guidance for install or a new docker file might be provided.

The problem is as follows:

CUDA_VISIBLE_DEVICES=2 bash tools/dist_train.sh configs/fcaf3d/fcaf3d_sunrgbd-3d-10class-r0.05-aug.py 1

/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprec
ated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
Traceback (most recent call last):
File "tools/train.py", line 16, in
from mmdet3d.apis import train_model
File "/data1/zsch/project/DQS3D/mmdet3d/apis/init.py", line 1, in
from .inference import (convert_SyncBN, inference_detector,
File "/data1/zsch/project/DQS3D/mmdet3d/apis/inference.py", line 10, in
from mmdet3d.core import (Box3DMode, DepthInstance3DBoxes,
File "/data1/zsch/project/DQS3D/mmdet3d/core/init.py", line 1, in
from .anchor import * # noqa: F401, F403
File "/data1/zsch/project/DQS3D/mmdet3d/core/anchor/init.py", line 1, in
from mmdet.core.anchor import build_anchor_generator
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/init.py", line 2, in
from .bbox import * # noqa: F401, F403
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/init.py", line 7, in
from .samplers import (BaseSampler, CombinedSampler,
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/samplers/init.py", line 9, in
from .score_hlr_sampler import ScoreHLRSampler
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmdet/core/bbox/samplers/score_hlr_sampler.py", line 2, in
from mmcv.ops import nms_match
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/ops/init.py", line 1, in
from .bbox import bbox_overlaps
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/ops/bbox.py", line 3, in
ext_module = ext_loader.load_ext('_ext', ['bbox_overlaps'])
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/utils/ext_loader.py", line 12, in load_ext
ext = importlib.import_module('mmcv.' + name)
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
ImportError: /data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6devic
eEv
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 944954) of binary: /data1/zsch/software/anaconda3/envs/dqs3d/bin/pytho
n3
Traceback (most recent call last):
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-14_20:58:32
host : hmc37
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 944954)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

c7w commented

I've replied to your email. It is resolved now?

Maybe you can also try this:

After installing pytorch+cudatoolkit, build mmcv==1.3.8 library from source and install it.


I hadn't run into that problem... But I think that is because your mmcv library is not successfully installed. See:

ImportError: /data1/zsch/software/anaconda3/envs/dqs3d/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZNK2at6Tensor6deviceEv

After a few search on the Internet, I found this link: open-mmlab/mmdetection#4291 (comment)

Hope it'd help.