NVIDIA/NeMo-Framework-Launcher

Does NeMo-Megatron-Launcher support training from bare metal environment

zigzagcai opened this issue · 1 comments

Hello, I want to run NeMo-Megatron-Launcher in a non-root slurm cluster (where docker engine cannot be installed), and I can't find reference guide for training in bare metal environment.
I tried to install the packages according to the provided Dockerfile, but failed with some package installation or code got crashed.
Could you please provide some hints to run NeMo-Megatron-Launcher in bare metal environment? Thanks!

Update:
I have made some efforts and now the main branch code runnable in bare metal environment.
For those who also want to run NeMo in bare metal environment, FYI:
https://github.com/zigzagcai/NeMo-Megatron-Launcher/tree/baremetal_run
https://github.com/zigzagcai/NeMo/tree/baremetal_run
https://github.com/zigzagcai/Megatron-LM/tree/baremetal_run