Torch 2.3 breaks DDP?

Question

Torch 2.3 breaks DDP?

TParcollet opened this issue 2 months ago · 7 comments

Describe the bug

Starting any recipe with two or + processes will fail using DDP as the run_on_main function does not catch (no barrier?) the other processes.

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

Expected behaviour

Well, the barrier stops the other processes.

To Reproduce

No response

Environment Details

No response

Relevant Log Output

No response

Additional Context

No response

Answer 1 · 2024-04-29T13:28:28.000Z

I'm unable to reproduce your issue.

I tried on a two nodes settings with the following example:

    import os 
    if os.environ['RANK'] == '0':
        import time
        # wait for 2s
        print("Rank 0 is sleeping")
        time.sleep(5)
        print("Rank 0 finished sleeping")
        print("Rank 0 is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print("Rank 0 is done")
    else:
        print(f"Rank {os.environ['RANK']} is ready for barrier")
        sb.utils.distributed.ddp_barrier()
        print(f"Rank {os.environ['RANK']} is done")

And got the following output:

Rank 1 is ready for barrier
Rank 0 is sleeping
Rank 0 finished sleeping
Rank 0 is ready for barrier
Rank 0 is done
Rank 1 is done

I also tried with a recipe:

    from librispeech_prepare import prepare_librispeech  # noqa

    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

    sb.utils.distributed.ddp_barrier() 

    if os.environ['RANK'] == '0':
        print("*" * 10)
    print("BEFORE")
    print("RANK", os.environ["RANK"])

    # multi-gpu (ddp) save data preparation
    run_on_main(
        prepare_librispeech,
        kwargs={
            "data_folder": hparams["data_folder"],
            "tr_splits": hparams["train_splits"],
            "dev_splits": hparams["dev_splits"],
            "te_splits": hparams["test_splits"],
            "save_folder": hparams["output_folder"],
            "merge_lst": hparams["train_splits"],
            "merge_name": "train.csv",
            "skip_prep": hparams["skip_prep"],
        },
    )
    print("AFTER")
    print("RANK", os.environ["RANK"])

And got:

BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1
**********
BEFORE
RANK 0
BEFORE
RANK 1
librispeech_prepare - Skipping preparation, completed in previous run.
AFTER
RANK 0
AFTER
RANK 1

I also tried to put multiple barrier and also got the expected results.

Answer 2 · 2024-04-29T17:50:57.000Z

Mystery / 20, I'll investigate once done with my current duties at SAIC.

Answer 3 · 2024-05-16T11:45:17.000Z

Can't repro on Jean Zay with 4x A100 on DDP either with PyTorch 2.3.0

Answer 4 · 2024-05-30T14:14:06.000Z

Can't repro on Adastra with 8x MI250X on DDP either, also PyTorch 2.3.0

Also:

From the doc of 2.3, we can read: "ProcessGroupNCCL now relies on stream synchronization instead of device synchronization to block the CPU. Thus, please do not assume that barrier() would perform a device synchronization." This is scary.

It did not occur to me at the time, but I don't think this is actually a problem: I suspect "device synchronization" does not refer to the synchronization "between devices", but rather to placing a barrier to only the current CUDA stream on a device rather than all CUDA streams.

This would be relevant if we made explicit use of CUDA streams but we don't.
The documentation as phrased still says this is to be used as a barrier between processes, which is all that we use it for.

Answer 5 · 2024-05-30T15:31:30.000Z

I suggest to close this issue.

Answer 6 · 2024-05-31T12:07:42.000Z

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

Answer 7 · 2024-05-31T12:11:59.000Z

Right, it's most likely a hardware issue on my side. It still happens, dunno why.

Maybe it's something similar to what @Adel-Moumen what at some point where all processes thought they were the main one IIRC?