Lightning-Universe/lightning-transformers

trainer deepspeed fails

Closed this issue · 1 comments

! pip install git+https://github.com/PytorchLightning/lightning-transformers.git@master --upgrade --quiet
! python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer=deepspeed

traceback:

2021-04-25 13:17:34.274832: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "train.py", line 88, in <module>
    hydra_entry()
  File "/usr/local/lib/python3.7/dist-packages/hydra/main.py", line 33, in decorated_main
    config_name=config_name,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 370, in _run_hydra
    lambda: hydra.run(
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/utils.py", line 373, in <lambda>
    overrides=args.overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 90, in run
    run_mode=RunMode.RUN,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/hydra.py", line 524, in compose_config
    from_shell=from_shell,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 149, in load_configuration
    from_shell=from_shell,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/config_loader_impl.py", line 236, in _load_configuration_impl
    skip_missing=run_mode == RunMode.MULTIRUN,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 717, in create_defaults_list
    skip_missing=skip_missing,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 688, in _create_defaults_list
    skip_missing=skip_missing,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 343, in _create_defaults_tree
    overrides=overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 420, in _create_defaults_tree_impl
    return _expand_virtual_root(repo, root, overrides, skip_missing)
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 268, in _expand_virtual_root
    overrides=overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 532, in _create_defaults_tree_impl
    add_child(children, new_root)
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 481, in add_child
    overrides=overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 532, in _create_defaults_tree_impl
    add_child(children, new_root)
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 481, in add_child
    overrides=overrides,
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 451, in _create_defaults_tree_impl
    config_not_found_error(repo=repo, tree=root)
  File "/usr/local/lib/python3.7/dist-packages/hydra/_internal/defaults_list.py", line 769, in config_not_found_error
    options=options,
hydra.errors.MissingConfigException: In 'trainer/deepspeed': Could not find 'trainer/plugins/zero_offload'

Available options in 'trainer/plugins':
	deepspeed
	deepspeed_offload
	deepspeed_offload_stage_3
	sharded
Config search path:
	provider=hydra, path=pkg://hydra.conf
	provider=main, path=file:///usr/local/lib/python3.7/dist-packages/conf
	provider=schema, path=structured://```

Thanks for the issue! need to update the DeepSpeed trainer config, but this is the preferred approach:

! pip install git+https://github.com/PytorchLightning/lightning-transformers.git@master --upgrade --quiet
! python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext trainer/plugins=deepspeed