Support distributed training in AllenNLP integration

Question

Support distributed training in AllenNLP integration

himkt opened this issue 4 years ago · 2 comments

Current AllenNLPExecutor doesn't consider distributed training.
It would be useful if an executor supports distributed training.

Description

AllenNLP supports multi-device distributed training (https://medium.com/ai2-blog/c4d7c17eb6d6).
However, the current AllenNLPExecutor doesn't consider the multiple GPUs situation in distributed optimization.

In distributed optimization, multiple processes may try to allocate memories on multiple GPUs and it could cause a memory error. It would be needed to schedule GPUs for each process.

Answer 1 · 2020-11-14T06:06:07.000Z

Distributed training in AllenNLP spawns processes for training.
AllenNLPExecutor may fail to pass environment variables to them.
(himkt/allennlp-optuna#20)

Answer 2 · 2021-11-25T15:39:48.000Z

#2977 introduces the support for distributed training. This feature will be available on the next release of Optuna.