Issues
- 0
- 0
- 4
How to set up fp8 training
#817 opened by yangzhipeng1108 - 0
[BUG]:there is a small chance that it will get stuck, If i repeat runing test_serialization.py many times,
#825 opened by starkhu - 0
Does Megatron has plan to support llama pre-train?
#824 opened by wen020 - 0
Projeto liliti stk 3.6.9 inteligência artificial 🤖
#822 opened by felipeliliti - 0
Projeto liliti stk 3.6.9 inteligência artificial
#821 opened by felipeliliti - 1
- 0
Vamos supor que eu colabore Para o projeto porém estou no Brasil e até agora não ganhei nada trabalhando como cientista de dados como faço para ganhar algum dinheiro para alimentar minha família?
#808 opened by felipeliliti - 0
- 0
- 0
Executive MBA | IIT Roorkee | Coursera
#819 opened by felipeliliti - 0
Projeto liliti stk 3.6.9 inteligência artificial multimidal para trazer à paz mundial
#820 opened by felipeliliti - 2
Megatron-LM for LLaMa3
#818 opened by SDsly - 2
[BUG] Typo in drop_policy options in moe_utils.py
#815 opened by Malikeh97 - 1
- 2
[BUG] [MoE] Typo in Token Drop policy's default value
#812 opened by passaglia - 2
[QUESTION] RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=2, worker_count=1, timeout=0:10:00)
#782 opened by JanryPei - 4
[BUG] distributed optimizer doesn't work when data parallel size is odd number.
#792 opened by okoge-kaz - 1
[BUG]Environment: Megatron0.5.0+TE1.4; Issue: I began training a model without using the --use-mcore-models option for model pretraining. Later, I needed to use the CP function for fine-tuning longer sequences, thus I had to enable --use-mcore-models. However, I discovered that I couldn't load the previous pre-trained model and encountered an error.
#793 opened by liangshaopeng - 0
[core dataset compilation error]
#807 opened by shamanez - 5
- 1
[QUESTION] How to pre-build the dataset's index ?
#795 opened by etiennemlb - 2
[BUG] Bug of expert model parallel
#766 opened by 1049451037 - 0
[BUG] Example of pretraining BERT does not work
#791 opened by xju2 - 0
[QUESTION] bf16 Parameters and fp32 Gradients
#800 opened by pluiez - 0
Why doesn't M-Core use flash attention
#799 opened by Life-0-1 - 0
One H100 will oom when using meagatron to train llama2-70b. How to use two H100 to train llama2-70b?
#783 opened by yangzhipeng1108 - 15
[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS
#756 opened by ShinoharaHare - 1
When can we have a the MOE checkpoint convert script.
#790 opened by shamanez - 0
[QUESTION] Validation loss & PPL keep going up
#787 opened by zhentingqi - 2
[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective
#780 opened by ftgreat - 1
[QUESTION] Why megatron-core seems slower and use more gpu mem than legacy for gpt_pretrain?
#770 opened by REIGN12 - 0
[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?
#785 opened by ezioliao - 4
- 2
- 2
[BUG] The bug about the options of the Megatron-core, transformer-impl and flash-attention.
#778 opened by Baibaifan - 1
[QUESTION] Is PackedSeqParams still under development?
#771 opened by XLzed - 2
[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format
#773 opened by uehara-mech - 2
[BUG] The gradient allreduce/reduce-scatter operation is performed twice when overlap_grad_reduce is False
#775 opened by sandyhouse - 0
- 0
- 4
- 0
- 4
[QUESTION] why pipeline-model-parallel size should be greater than 2 with interleaved schedule ?
#750 opened by nullnonenilNULL - 2
- 1
[BUG] ModuleNotFoundError: No module named 'megatron.training.tokenizer'; 'megatron.training' is not a package
#763 opened by hellangleZ - 4
- 2
- 0
Loss mask uses torch.float32 instead of bool
#754 opened by pilot7747