NeMo2.0 nemorun llm export ValueError: PyTorch DDP is not enabled for mcore optimizer
Closed this issue · 2 comments
lifeiteng commented
Describe the bug
A clear and concise description of what the bug is.
Steps/Code to reproduce bug
nemorun llm import llama3_8b hf://meta-llama/Meta-Llama-3-8B -y
nemorun llm export ~/.cache/nemo/models/meta-llama/Meta-Llama-3-8B/context hf exp/PreTrain/export_llama3_8b -y
error
Dry run for task nemo.collections.llm.api:export_ckpt
Resolved Arguments
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Argument Name ┃ Resolved Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ load_connector │ <function load_connector_from_trainer_ckpt at │
│ │ 0x7feaf6adfbe0> │
│ output_path │ PosixPath('exp/PreTrain/export_llama3_8b') │
│ overwrite │ False │
│ path │ PosixPath('/root/.cache/nemo/models/meta-llama/Meta-Llama-3… │
│ target │ 'hf' │
└──────────────────────┴──────────────────────────────────────────────────────────────┘
Launching None...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using `tokenizers` before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-10-18 09:16:11 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[ERROR | root ]: An error occurred: PyTorch DDP is not enabled for mcore optimizer
Traceback (most recent call last):
File "/usr/local/bin/nemorun", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
raise e
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
return get_command(self)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 723, in main
return _main(
File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
return callback(**use_params)
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/cli/api.py", line 793, in command
self.cli_execute(fn, ctx.args, type)
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/cli/api.py", line 845, in cli_execute
self._execute_task(fn, filtered_args)
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/cli/api.py", line 895, in _execute_task
run_task()
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/cli/api.py", line 874, in run_task
run.run(
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/run/api.py", line 65, in run
direct_run_fn(fn_or_script, dryrun=dryrun)
File "/home/lifeiteng/code/NeMo-Run/src/nemo_run/run/task.py", line 77, in direct_run_fn
built_fn()
File "/home/lifeiteng/code/NeMo/nemo/collections/llm/api.py", line 432, in export_ckpt
return io.export_ckpt(path, target, output_path, overwrite, load_connector)
File "/home/lifeiteng/code/NeMo/nemo/lightning/io/api.py", line 197, in export_ckpt
return exporter(overwrite=overwrite, output_path=_output_path)
File "/home/lifeiteng/code/NeMo/nemo/lightning/io/connector.py", line 85, in __call__
to_return = self.apply(_output_path)
File "/home/lifeiteng/code/NeMo/nemo/collections/llm/gpt/model/llama.py", line 301, in apply
source, _ = self.nemo_load(str(self))
File "/home/lifeiteng/code/NeMo/nemo/lightning/io/connector.py", line 216, in nemo_load
_trainer.strategy.connect(model)
File "/home/lifeiteng/code/NeMo/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 286, in connect
raise ValueError("PyTorch DDP is not enabled for mcore optimizer")
ValueError: PyTorch DDP is not enabled for mcore optimizer
Expected behavior
The export of the Meta-Llama-3-8B model should complete successfully without errors, resulting in a checkpoint file stored in the specified path.
akoumpa commented
Thanks for reporting this bug, will look into it asap & push a fix.