[Bug] MMDistributedDataParallel have no effect

Question

[Bug] MMDistributedDataParallel have no effect

doodoo0006 opened this issue 6 months ago · 4 comments

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

myenv is:
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
pip install -U openmim -i https://pypi.tuna.tsinghua.edu.cn/simple
mim install mmengine
mim install mmcv==2.1.0
mim install mmdet==3.2.0

and I use mmdet3d==1.3.0 to train centerpoint, I find problem blew.

when i use batch=1, gpu=1 to train, 1 iter model forward time cost is 48ms,
when i use batch=6, gpu=1 to train, 1 iter model forward time cost is 630ms( there 6 pointdata infer once time.), I check the mmengine code, when gpu_size==1, model is original model, and with no MMDataParallel like wrapper on it.
when i use batch=6, gpu=4 to train, 1 iter model forward time cost is like below, max about is 800ms, min is 270ms
loss time is 0.7451076507568359s cuda is 0
loss time is 0.7903275489807129s cuda is 2
loss time is 0.7789270877838135s cuda is 3
loss time is 0.8002550601959229s cuda is 1
loss time is 0.7657649517059326s cuda is 3
loss time is 0.7790787220001221s cuda is 0
loss time is 0.7731163501739502s cuda is 2
loss time is 0.9755470752716064s cuda is 1
loss time is 0.7812612056732178s cuda is 3
loss time is 0.7439537048339844s cuda is 0
loss time is 0.7788212299346924s cuda is 2
loss time is 0.8221783638000488s cuda is 1
loss time is 0.2801499366760254s cuda is 3
loss time is 0.2814157009124756s cuda is 0
loss time is 0.2818455696105957s cuda is 1
loss time is 0.2732048034667969s cuda is 2
but I check same config on mmcv, the time cost is only 200ms
there may have something my config is not correct.

Reproduces the problem - code sample

my train 4 gpu commond is
CUDA_VISIBLE_DEVICES=4,5,6,7 tools/dist_train.sh configs/zd_test_speed/net_202312.py 4
my single test commond is (with env set CUDA_VISIBLE_DEVICES=4)
python train.py configs/zd_test_speed/net_202312.py

Reproduces the problem - command or script

script is dist_train.sh original on mmdet3d

Reproduces the problem - error message

NA

Additional information

No response

Answer 1 · 2023-12-21T14:59:36.000Z

add.
when i use batch=1, gpu=1 to train, 1 iter model forward time cost is 48ms,
I test this config on mmcv, the 1 data infer time also is 48ms
thank you

Answer 2 · 2023-12-23T04:23:06.000Z

Hi @doodoo0006 , did you try to use nvidia-smi to see the usage of GPUs?

Answer 3 · 2023-12-25T03:11:58.000Z

@zhouzaida
batch = 1, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 8.57G

batch = 1, gpu = 4, full-train time is 1days18h， nvidia-smi Memory-Usage is 8.57G * 4

batch = 4, gpu = 4, full-train time is 2days(not 1day18h vs batch1gput4)， nvidia-smi Memory-Usage is 22.9G * 4

batch = 4, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 22.5G * 1

batch = 6, gpu = 1, full-train time is 6days， nvidia-smi Memory-Usage is 32.2G * 1

batch = 6, gpu = 4, full-train time is 1days23h， nvidia-smi Memory-Usage is 32.2G * 4

Answer 4 · 2023-12-25T03:13:58.000Z

so now the best config is batch = 1, gpu = 4, batch infer in single gpu have no effcet