Multi gpu support
Closed this issue ยท 2 comments
๐ Bug
To Reproduce
Steps to reproduce the behavior:
run python /home/jovyan/lightning-transformers/train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer.gpus=2 training.batch_size=8
see the error:
AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup.<locals>.lr_lambda
Code sample
python /home/jovyan/lightning-transformers/train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer.gpus=2 training.batch_size=8
Expected behavior
It should train on 2 gpu
Environment
jupyterlab on python 3 using the master branch
Thanks for the issue marc!
This seems to be due to using ddp_spawn
as default for trainer.accelerator
when number of GPUs is greater than 1. Short term fix is to do trainer=ddp
which will only work on cmdline (you must use ddp_spawn
for notebooks).
I'll look into a solution regarding ddp_spawn to see if I can keep this as default, but if there are issues with mp.spawn we can default to standard ddp :)
Just to track, I've found the issue being the scheduler defining a local function in the function itself.
Unfortunately this is a difficult problem to resolve, I'll investigate a bit further, and see if Flash has resolved this issue.
EDIT: Can confirm Flash also runs into the same issue.