optuna/optuna-examples

Does integration.TorchDistributedTrial support multinode optimization?

siemdejong opened this issue · 0 comments

Does integration.TorchDistributedTrial support multinode optimization?

I'm using Optuna on a SLURM cluster. Suppose I would like to do a distributed hyperparameter optimization using two nodes with two gpus each. Would submitting a script like pytorch_distributed_simple.py to multiple nodes yield expected results?

I assume every node would be responsible for executing their own trials (i.e. no nodes share trials) and every gpu on a node is responsible for its own portion of the data, determined by torch.utils.data.Dataloader's sampler. Is this assumption correct or are edits needed apart from TorchDistributedTrial's requirement to pass None to objective calls on ranks other than 0.

I already tried the above, but I'm not sure how to check every node is responsible for distinct trials.


StackOverflow crosspost