NVIDIA/NeMo

Slurm interactive mode, transcribe_speech_parallel.py gets stuck on consecutive runs

Opened this issue · 0 comments

With container nvcr.io/nvidia/nemo:24.07 (Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the command

python3 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

the script will get stuck on the second of multiple consecutive runs. The point where it does so is

> HERE
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
...

However if I execute the command

torchrun --standalone --nnodes=1 --nproc-per-node=1 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

I can run multiple consecutive times without any issues.

I have tried with standard debug parameters like

TORCH_CPP_LOG_LEVEL=INFO 
TORCH_DISTRIBUTED_DEBUG=INFO 
NCCL_DEBUG=INFO 
NCCL_DEBUG_SUBSYS=ALL

but nothing peculiar pops out.