fgnt/ham_radio

Segmentation Fault when Training

Closed this issue · 2 comments

Hello,

I set up this repository along with dependencies (paderbox, padertorch, etc.) using Python 3.10 and upon training I got an segmentation fault during the test run. I am using a RTX 4090, and this is the output in the terminal:

Start test run
Move trainer, model and optimizer to cuda:0.
  0%|                                                                    | 1/100000 [00:00<?, ?it/s]Starting Validation
Finished Validation. Mean loss: 9192.9033203125
2023-10-06 10:56:55.385097: Saved model and optimizer state at iteration 0 to /tmp/tmp_vmbns4t/checkpoints/ckpt_0.pth
Segmentation fault (core dumped)

Any help would be appreciated. Thanks!

I traced the issue to be in padertorch>train>trainer.py>Trainer>step . In the if block with len(device) == 1, the segmentation fault occurs at the self.train_step line.

Will open an issue in Padertorch. Link: fgnt/padertorch#150

Model trained again using PyTorch 1.13.0 instead of 2.1.0. Closing this for now.