Addressing Bottlenecks in Training
Closed this issue · 2 comments
I am using two RTX 3090 GPUs to run train.py following the provided guide.
I only used one FullNarrowBlock, but the training process took 5 hours to complete just one of the 15 epochs. The training process took too much time than I expected.
When I Checked the GPU utilization, it seemed there was a bottleneck somewhere in the code. I suspect the bottleneck might be in the data loading and processing part.
I am wondering if this is a normal occurrence.
If there is something wrong, could you give me an advice how to deal with it?
Sorry for the late response. I recommend using the lightning version of the code for training. It is faster than the torch version. Using mixed precision in the lightning version will further speed up the training. Regarding the bottleneck issue you mentioned, one possible reason is that we are saving too many attributes when storing the simulation data (Dataset.AcousticScene), which might cause an I/O bottleneck during training. You can try removing unnecessary attributes when saving the simulation data.
In the lightning version, you can add '--trainer.precision=16-mixed' to the train command to enable mixed precision training.
python main.py fit --data.batch_size=[*,*] --trainer.devices=*,* (--trainer.precision=16-mixed)