Segmentation Fault when Training Model

Question

Segmentation Fault when Training Model

samly97 opened this issue a year ago · 2 comments

I ran into an issue with training a model using Padertorch [https://github.com/fgnt/ham_radio/issues/1](Ham radio). I get a segmentation fault in the training loop function. When if place a pdb debug statement in train_step I get the loss, summary, etc but I get the segmentation fault upon return. Location:

padertorch>train>trainer.py>Trainer>step . In the if block with len(device) == 1

I am using a single RTX 4090 on Python 3.10 with PyTorch 2.1.0.

Thank you.

Answer 1 · 2023-10-06T17:26:01.000Z

Hmm. If I remember correctly, there is only python code from us involved. Could you try to find which operation causes this segmentation fault? The train steps moves the data to the GPU, executes the model forward and review and the optimiser step. I guess one torch operation causes this segmentation fault.

Answer 2 · 2023-10-06T20:03:15.000Z

Thanks for the reply. I uninstalled PyTorch 2.1.0 and used PyTorch 1.13.0 and the model started training. I'll close this issue.