Potential Issues with Multi-GPU/Node Training with Central Network Weights Initialization

Question

Potential Issues with Multi-GPU/Node Training with Central Network Weights Initialization

Closed this issue 4 days ago · 0 comments

Hi, Thank you for the great work on this project. I have a couple of questions regarding the multi-GPU/multi-node training implementation, specifically in the context of the central network.

From my understanding of the source code, it appears that the initial parameters of the actor_critic model on GPU rank_0 are broadcast to other GPU replicas to ensure they hold the same initial parameters. My questions are as follows:

Is broadcasting the initial parameters of the actor_critic model sufficient to ensure that all GPU replicas maintain the same parameters throughout training?
Given that different seeds might be used for each GPU, the central_network could potentially be initialized with different parameter weights. Could this be a potential issue for multi-GPU/multi-node training when using the central_network?
I appreciate your time and assistance in addressing these questions