Lightning-Universe/lightning-transformers

Multi gpu support

Closed this issue ยท 2 comments

๐Ÿ› Bug

To Reproduce

Steps to reproduce the behavior:

run python /home/jovyan/lightning-transformers/train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer.gpus=2 training.batch_size=8
see the error:
AttributeError: Can't pickle local object 'get_linear_schedule_with_warmup.<locals>.lr_lambda

Code sample

python /home/jovyan/lightning-transformers/train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer.gpus=2 training.batch_size=8

Expected behavior

It should train on 2 gpu

Environment

jupyterlab on python 3 using the master branch

Thanks for the issue marc!

This seems to be due to using ddp_spawn as default for trainer.accelerator when number of GPUs is greater than 1. Short term fix is to do trainer=ddp which will only work on cmdline (you must use ddp_spawn for notebooks).

I'll look into a solution regarding ddp_spawn to see if I can keep this as default, but if there are issues with mp.spawn we can default to standard ddp :)

Just to track, I've found the issue being the scheduler defining a local function in the function itself.

Unfortunately this is a difficult problem to resolve, I'll investigate a bit further, and see if Flash has resolved this issue.

EDIT: Can confirm Flash also runs into the same issue.