the problem occured when I use the script offered:

Question

the problem occured when I use the script offered:

shanyang0509 opened this issue 2 years ago · 3 comments

shanyang0509 commented 2 years ago

I used run_distributed_on_disk_a6k5_AdamW_Curicullum_Tiny_assistant_teacher_num_1_aa.sh to train , the bug is as below:

7 GPUs per Node

Multiprocessing_distributed Training

True
Use GPU: 0 for training

Finished group initialization

True
Use GPU: 2 for training

Finished group initialization

True
Use GPU: 3 for training

Finished group initialization

True
Use GPU: 5 for training

Finished group initialization

True
Use GPU: 4 for training

Finished group initialization

True
Use GPU: 1 for training

Finished group initialization

True
Use GPU: 6 for training

Finished group initialization

Traceback (most recent call last):
File "train_assistant_group_amp_fune.py", line 919, in
main()
File "train_assistant_group_amp_fune.py", line 912, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/workspace/BNext/src/train_assistant_group_amp_fune.py", line 272, in main_worker
assistant_teachers.append(models.efficientnet_b0(pretrained=True))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Sequential' object has no attribute 'append'

Answer 1 · 2023-01-09T09:11:57.000Z

Hi, Thanks for your attention. I have rechecked the code but unfortunately did not get the same problem. Maybe you can recheck your PyTorch environment instead.

Answer 2 · 2023-01-10T03:37:25.000Z

Thank you for your anwers.Could you send me your training environment configuration ? thanks!

Answer 3 · 2023-01-16T21:24:22.000Z

Please refer to the requirementx.txt, as suggest in #3 (comment)