ERes2Net训练问题
YiChen1997 opened this issue · 4 comments
YiChen1997 commented
您好~我在训练ERes2Net时,两个半小时才能跑完一个epoch的五分之一,请问这样的速度正常吗?GPU是2080ti,显存11G,batchsize=16(运行时显存占用61%)。因为论文里说参数量为4.64M,我感觉这样的速度和显存占用不太对诶。
另外3卡训练(3×2080ti,batchsize=48)的时候也提示错误,所以我想请问一下上面训练时的训练速度以及显存占用是正常的吗
RROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 7450) of binary: /data/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/pytorch/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
====================================================
speakerlab/bin/train.py FAILED
----------------------------------------------------
Failures:
[1]:
time : 2024-03-25_05:30:12
host : pytorch
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 7451)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 7451
[2]:
time : 2024-03-25_05:30:12
host : pytorch
rank : 2 (local_rank: 2)
exitcode : -7 (pid: 7452)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 7452
----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-25_05:30:12
host : pytorch
rank : 0 (local_rank: 0)
exitcode : -7 (pid: 7450)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 7450
====================================================
yfchenlucky commented
YiChen1997 commented
感谢您的回复!CAM++下yaml的batchsize设置同样的256,我希望继续请问一下它用四张32g v100的训练时间也是半个小时左右吗?因为我有单独移植cam++模型部分的代码,单张3090ti-24G上可以跑到256的batchsize,一个epoch约1个小时24分钟,但是两者的batchsize的大小似乎不成比例,这又是为何呢
yfchenlucky commented
- CAM++ 4张32g v100 1epoch训练时间25min左右。(针对3dspeaker dataset)
- 和你的GPU利用率有关,单张3090ti如果1epoch 1h24min可以训练完,说明GPU利用效率极高,两者不存在比例关系。
YiChen1997 commented
好的谢谢,您的回答对我有很大的帮助,大概知道接下来怎么调整了,非常感谢您的回复!