modelscope/3D-Speaker

ERes2Net训练问题

YiChen1997 opened this issue · 4 comments

您好~我在训练ERes2Net时,两个半小时才能跑完一个epoch的五分之一,请问这样的速度正常吗?GPU是2080ti,显存11G,batchsize=16(运行时显存占用61%)。因为论文里说参数量为4.64M,我感觉这样的速度和显存占用不太对诶。
另外3卡训练(3×2080ti,batchsize=48)的时候也提示错误,所以我想请问一下上面训练时的训练速度以及显存占用是正常的吗

RROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 7450) of binary: /data/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
  File "/data/miniconda3/envs/pytorch/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/miniconda3/envs/pytorch/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
speakerlab/bin/train.py FAILED
----------------------------------------------------
Failures:
[1]:
  time      : 2024-03-25_05:30:12
  host      : pytorch
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 7451)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 7451
[2]:
  time      : 2024-03-25_05:30:12
  host      : pytorch
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 7452)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 7452
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-25_05:30:12
  host      : pytorch
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 7450)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 7450
====================================================

以3D-Speaker dataset为例,ERes2Net-base模型,4张32G显存V100训练一个epoch需要半小时,batchsize=256,单卡batchsize是64。模型参数和性能如下图所示。
截屏2024-03-25 下午4 13 27
不建议使用单卡跑因为速度太慢,DDP可以缩短训练时长,看上述报错可能是因为torch的多进程训练导致,你可以重新clone代码并分配正确的GPU进行尝试。
如果还有其他的问题请随时与我交流,期待您的成功尝试!

感谢您的回复!CAM++下yaml的batchsize设置同样的256,我希望继续请问一下它用四张32g v100的训练时间也是半个小时左右吗?因为我有单独移植cam++模型部分的代码,单张3090ti-24G上可以跑到256的batchsize,一个epoch约1个小时24分钟,但是两者的batchsize的大小似乎不成比例,这又是为何呢

  1. CAM++ 4张32g v100 1epoch训练时间25min左右。(针对3dspeaker dataset)
  2. 和你的GPU利用率有关,单张3090ti如果1epoch 1h24min可以训练完,说明GPU利用效率极高,两者不存在比例关系。

好的谢谢,您的回答对我有很大的帮助,大概知道接下来怎么调整了,非常感谢您的回复!