Training Error

Question

Training Error

mohammadian1399 opened this issue 3 years ago · 0 comments

Hello and thanks for your good idea, I'm new in deep learning and tried to train your model with my data in the following command:
torchrun --standalone --nnodes = 1 --nproc_per_node = 1 train.py -C fullsubnet / my_train.toml
The only change that In the file "train.toml" I gave (except for data paths ) it was that in the [train_dataset.dataloader] section, I put the batch size= 8 and num_workers= 36, but I got an error. The part of the error file is as follows:

Training: 100%|██████████| 7500/7500 [1:07:11<00:00, 1.86it/s]
Validation: 0it [00:00, ?it/s]
Validation: 0it [00:00, ?it/s]
Traceback (most recent call last):
File "/home/p.mohammadian.student.sharif/FullSubNet/recipes/dns_interspeech_2020/train.py", line 103, in
entry(local_rank, configuration, args.resume, args.only_validation)
File "/home/p.mohammadian.student.sharif/FullSubNet/recipes/dns_interspeech_2020/train.py", line 72, in entry
trainer.train()
File "/home/p.mohammadian.student.sharif/FullSubNet/audio_zen/trainer/base_trainer.py", line 337, in train
metric_score = self._validation_epoch(epoch)
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, kwargs)
File "/home/p.mohammadian.student.sharif/FullSubNet/recipes/dns_interspeech_2020/fullsubnet/trainer.py", line 111, in _validation_epoch
self.writer.add_scalar(f"Loss/Validation_Total", loss_total / len(self.valid_dataloader), epoch)
ZeroDivisionError: float division by zero
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34536) of binary: /home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/bin/python
Traceback (most recent call last):
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.2', 'console_scripts', 'torchrun')())
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init**.py", line 345, in wrapper
return f(*args, kwargs)
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/p.mohammadian.student.sharif/.conda/envs/p.mohammadian.student.sharif/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-03-01_00:13:28
host : GPU-4-0-3.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34536)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Thanks for your guidance!

train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-03-01_00:13:28 host : GPU-4-0-3.local rank : 0 (local_rank: 0) exitcode : 1 (pid: 34536) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-03-01_00:13:28
host : GPU-4-0-3.local
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 34536)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html