optuna/optuna-examples

Optuna with DDP training using multiple GPU

milliema opened this issue · 6 comments

The DDP example shows the case of using cpu devices and dist.init_process_group("gloo"). When I switch to multiple gpus environment, with dist.init_process_group('nccl', xxx), optuna seems not able to work. It reports error when trying to suggest values. e.g.
n_layers = trial.suggest_int("n_layers", 1, 3)
RuntimeError: Tensors must be CUDA and dense

Hi, could you share minimal reproducible codes with us?

I'm using the given example "optuna-examples/pytorch/pytorch_distributed_simple.py".
Modify line 41 for switching to gpu device:
DEVICE = torch.device("cuda", int(rank))
And then modify line167-170 as shown below:
method = 'file://' + 'change this to the output_path' + '/shared' dist.init_process_group('nccl', init_method=method, world_size=int(world_size), rank=int(rank))
Then it should be able to reproduce the issue.

Just figure out, need to pass device into trial definition.
optuna.integration.TorchDistributedTrial(trial, device=DEVICE)
Closed with thanks.

I still have issues with optuna under DDP mode with multiple gpus.
Whenever the prune happens, the process throws out error and terminates. I wonder maybe the exception does not broadcast to other nodes thus leads to asynchronization. Do you have any suggestion to solve this issue?
BTW, I really appreciate some examples on this, as people usually use DDP with gpu (nccl backends) other than cpu (gloo backend) especially on big dataset. It's would be helpful to update "pytorch_distributed_simple.py" with different backend settings. Thanks.

I have another question on the rank, why do we set up rank twice? Is "rank = dist.get_rank()" necessary?

rank = os.environ.get("OMPI_COMM_WORLD_RANK")
if rank is None:
rank = os.environ.get("PMI_RANK")
os.environ["RANK"] = str(rank)
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "20000"
dist.init_process_group("gloo")
rank = dist.get_rank()

how did you solve it

I still have issues with optuna under DDP mode with multiple gpus. Whenever the prune happens, the process throws out error and terminates. I wonder maybe the exception does not broadcast to other nodes thus leads to asynchronization. Do you have any suggestion to solve this issue? BTW, I really appreciate some examples on this, as people usually use DDP with gpu (nccl backends) other than cpu (gloo backend) especially on big dataset. It's would be helpful to update "pytorch_distributed_simple.py" with different backend settings. Thanks.