epfLLM/Megatron-LLM

[Megatron Base Version] Would mind share the based version of Megatron ?

dumpmemory opened this issue · 7 comments

I have found the code for get_checkpoint_name(s) and DistributedOptimizer are different. The upstream version had fix many bugs . Would u mind rebase it ?

_copy_model_params_to_main_params is missing in DistributedOptimizer

and get distributed_optimizer name logic is also different.

In currently code, if i add use_distributed_optimizer, there will be error for data parallel group is not initialized

#68 add missing function

the base version of NVIDIA/Megatron-LM which our code branched of is March 28, 2023.

since then the structure of the base repo code has been refactored a bit by the NVIDIA team also. in terms of functionality though not much was changed. or could you point to a concrete bug which is present in our code and not in their updated one, which impacts the use?

so far we haven't encountered bugs in our training with Llama2 models of all sizes.

of course it would be best to rebase the code on top of newest megatron-lm, but this would take quite some effort. if anyone would like to help preparing the code that would be more than welcome

if i have time, i am willing to do so. would mind provding the commit hash which u modified from. current repo just remove the git logs from NVIDIA/Megatron-LM

Hi, sorry, we seem to have lost the actual commit, but we're pretty sure it's 035cae2ef9cc770784a3c3f2f46ecf9cd0d1380c, based on the timing.

e seem to have lost the actual com

I meet the same issue,when I use --use_distributed_optimizer it error data parallel group is not initialized. How to solve it?

e seem to have lost the actual com

I meet the same issue,when I use --use_distributed_optimizer it error data parallel group is not initialized. How to solve it?

pls see this #68