Segmentation Fault when Training
Closed this issue · 2 comments
samly97 commented
Hello,
I set up this repository along with dependencies (paderbox, padertorch, etc.) using Python 3.10 and upon training I got an segmentation fault during the test run. I am using a RTX 4090, and this is the output in the terminal:
Start test run
Move trainer, model and optimizer to cuda:0.
0%| | 1/100000 [00:00<?, ?it/s]Starting Validation
Finished Validation. Mean loss: 9192.9033203125
2023-10-06 10:56:55.385097: Saved model and optimizer state at iteration 0 to /tmp/tmp_vmbns4t/checkpoints/ckpt_0.pth
Segmentation fault (core dumped)
Any help would be appreciated. Thanks!
samly97 commented
I traced the issue to be in padertorch>train>trainer.py>Trainer>step . In the if block with len(device) == 1
, the segmentation fault occurs at the self.train_step line.
Will open an issue in Padertorch. Link: fgnt/padertorch#150
samly97 commented
Model trained again using PyTorch 1.13.0 instead of 2.1.0. Closing this for now.