[BUG]

Question

[BUG]

Opened this issue a month ago · 0 comments

Issue Title
Megatron-LM: Zero-1 with Distributed Optimizer Showing No Overlap in Communication and Computation

Issue Description
We are experiencing an issue with Megatron-LM where enabling zero-1 (--overlap-grad-reduce --overlap-param-gather) along with the distributed optimizer (use-distributed-optimizer) does not result in the expected overlap of communication and computation during training. Despite our attempts to increase CUDA_DEVICE_MAX_CONNECTIONS, we still observe serial execution of communication and computation steps.

Steps to Reproduce
Set up Megatron-LM training with zero-1 enabled (--overlap-grad-reduce --overlap-param-gather).
Enable the distributed optimizer (use-distributed-optimizer).
Attempt to increase CUDA_DEVICE_MAX_CONNECTIONS to a higher value.
Start the training process and observe the execution flow.
Expected Behavior
We expect to see overlap in communication and computation during training, as enabled by zero-1 and the distributed optimizer.

Actual Behavior
Communication and computation steps are executed serially, without any overlap.

Environment
Megatron-LM version: cafda95
PyTorch version: 2.1.0a0+32f93b1
CUDA version: 12.2
GPU models and configuration: H800

Timeline