Slurm interactive mode, transcribe_speech_parallel.py gets stuck on consecutive runs
Opened this issue · 0 comments
itzsimpl commented
With container nvcr.io/nvidia/nemo:24.07
(Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the command
python3 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...
the script will get stuck on the second of multiple consecutive runs. The point where it does so is
> HERE
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
...
However if I execute the command
torchrun --standalone --nnodes=1 --nproc-per-node=1 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...
I can run multiple consecutive times without any issues.
I have tried with standard debug parameters like
TORCH_CPP_LOG_LEVEL=INFO
TORCH_DISTRIBUTED_DEBUG=INFO
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=ALL
but nothing peculiar pops out.