hpi-xnor/BNext

the problem occured when I use the script offered:

shanyang0509 opened this issue · 3 comments

I used run_distributed_on_disk_a6k5_AdamW_Curicullum_Tiny_assistant_teacher_num_1_aa.sh to train , the bug is as below:

7 GPUs per Node

Multiprocessing_distributed Training

True
Use GPU: 0 for training

Finished group initialization

True
Use GPU: 2 for training

Finished group initialization

True
Use GPU: 3 for training

Finished group initialization

True
Use GPU: 5 for training

Finished group initialization

True
Use GPU: 4 for training

Finished group initialization

True
Use GPU: 1 for training

Finished group initialization

True
Use GPU: 6 for training

Finished group initialization

Traceback (most recent call last):
File "train_assistant_group_amp_fune.py", line 919, in
main()
File "train_assistant_group_amp_fune.py", line 912, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/workspace/BNext/src/train_assistant_group_amp_fune.py", line 272, in main_worker
assistant_teachers.append(models.efficientnet_b0(pretrained=True))
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 947, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Sequential' object has no attribute 'append'

Hi, Thanks for your attention. I have rechecked the code but unfortunately did not get the same problem. Maybe you can recheck your PyTorch environment instead.

Thank you for your anwers.Could you send me your training environment configuration ? thanks!

Please refer to the requirementx.txt, as suggest in #3 (comment)