optuna/optuna

Support distributed training in AllenNLP integration

himkt opened this issue · 2 comments

himkt commented

Current AllenNLPExecutor doesn't consider distributed training.
It would be useful if an executor supports distributed training.

Description

AllenNLP supports multi-device distributed training (https://medium.com/ai2-blog/c4d7c17eb6d6).
However, the current AllenNLPExecutor doesn't consider the multiple GPUs situation in distributed optimization.

In distributed optimization, multiple processes may try to allocate memories on multiple GPUs and it could cause a memory error. It would be needed to schedule GPUs for each process.

himkt commented

Distributed training in AllenNLP spawns processes for training.
AllenNLPExecutor may fail to pass environment variables to them.
(himkt/allennlp-optuna#20)

himkt commented

#2977 introduces the support for distributed training. This feature will be available on the next release of Optuna.