[BUG]:there is a small chance that it will get stuck, If i repeat runing test_serialization.py many times,

Question

[BUG]:there is a small chance that it will get stuck, If i repeat runing test_serialization.py many times,

Opened this issue a month ago · 0 comments

Describe the bug
If i repeat runing test/unit_tests/dist-checkpoing/test_serialization.py many times, there is a small chance that it will get stuck

To Reproduce
step1: cd tests/unit_tests
step2: bash ut.sh

the following file is ut.sh

#! /bin/bash
set -e

for((i=1;i<=500;i++));  
do   
        torchrun --nproc_per_node 8 -m pytest dist_checkpointing/test_serialization.py
done

Expected behavior
no error.

Stack trace/logs

dist_checkpointing/test_serialization.py rootdir: /workspace/volume/huyongan/LLM/github/Megatron-LM
plugins: flakefinder-1.1.0, shard-0.1.2, hypothesis-5.35.1, xdoctest-1.0.2, rerunfailures-13.0, xdist-3.5.0
collected 6 items
Running 6 items in this shard

dist_checkpointing/test_serialization.py ................FFFFFFFFF[rank4]:[E ProcessGroupNCCL.cpp:564] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:570] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1335] [PG 0 Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLGATHER, NumelIn=384, NumelOut=3072, Timeout(ms)=600000) ran for 600488 milliseconds before timing out.

Environment (please complete the following information):

Megatron-LM core_r0.5.0
PyTorch torch2.3
CUDA 12.3
NCCL 2.30.3