[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

Question

[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS

Opened this issue 2 months ago · 22 comments

As the title says, I wonder if this is normal.
If not, how should I optimize it?

Logs

using world size: 16, data-parallel size: 4, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:Llama2Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... False
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 4
  data_path ....................................... []
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 10
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 4
  ffn_hidden_size ................................. 14336
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 128
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ custom/ckpt/mixtral-8x7b
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... True
  log_throughput .................................. True
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 500
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 32768
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_bin_files .................................. True
  mock_data ....................................... True
  moe_aux_loss_coeff .............................. 0.01
  moe_grouped_gemm ................................ True
  moe_input_jitter_eps ............................ None
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dropping .............................. False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... 8
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 8
  num_workers ..................................... 2
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  profile ......................................... True
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ custom/ckpt/mixtral-8x7b
  save_interval ................................... 10000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 99990,8,2
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 4
  tensorboard_dir ................................. custom/ckpt/mixtral-8x7b/tensorboard
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. tokenizer.model
  tokenizer_type .................................. Llama2Tokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 500000
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... True
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_gpu_initialization .......................... None
  use_mcore_models ................................ True
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. mixtral-8x7b
  wandb_project ................................... megatron
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 32
> building Llama2Tokenizer tokenizer ...
 > padded vocab (size: 32000) with 256 dummy tokens (new size: 32256)
> initializing torch distributed ...
> initialized tensor model parallel with size 4
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.087 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 7.672 seconds
[rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 16.718
[after megatron is initialized] datetime: 2024-03-30 19:59:35 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (3, 0): 3221491712
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3221491712
 > number of parameters on (tensor, pipeline) model parallel rank (2, 0): 3221491712
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3221491712
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (402919424 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.final_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.output_layer.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight1
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(fp16=False, bf16=True, params_dtype=torch.bfloat16, optimizer='adam', lr=0.0001, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b1d5617b280>)
> learning rate decay style: cosine
WARNING: could not find the metadata file custom/ckpt/mixtral-8x7b/latest_checkpointed_iteration.txt 
    will not load any checkpoints and will start from random
> setting tensorboard ...
(min, max) time across ranks (ms):
    load-checkpoint ................................: (0.65, 0.93)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-03-30 20:00:43 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      64000000
    validation: 641280
    test:       1280
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = True
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-03-30 20:00:43 
done with setup ...
training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (67432.26, 67517.36)
    train/valid/test-data-iterators-setup ..........: (3.74, 338.85)
[before the start of training step] datetime: 2024-03-30 20:00:43 
[Rank 0] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62292.0 | max reserved: 62292.0
[Rank 1] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62292.0 | max reserved: 62292.0
 [2024-03-30 20:01:13] iteration        1/  500000 | consumed samples:          128 | elapsed time per iteration (ms): 29428.3 | throughput per GPU (TFLOP/s/GPU): 44.4 | learning rate: 2.000E-07 | global batch size:   128 | lm loss: 1.038043E+01 | loss scale: 1.0 | grad norm: 526.452 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
[Rank 3] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62294.0 | max reserved: 62294.0
[Rank 2] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62294.0 | max reserved: 62294.0
 [2024-03-30 20:01:22] iteration        2/  500000 | consumed samples:          256 | elapsed time per iteration (ms): 9845.6 | throughput per GPU (TFLOP/s/GPU): 132.6 | learning rate: 4.000E-07 | global batch size:   128 | lm loss: 1.047649E+01 | loss scale: 1.0 | grad norm: 506.118 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:01:32] iteration        3/  500000 | consumed samples:          384 | elapsed time per iteration (ms): 9638.2 | throughput per GPU (TFLOP/s/GPU): 135.5 | learning rate: 6.000E-07 | global batch size:   128 | lm loss: 1.027612E+01 | loss scale: 1.0 | grad norm: 519.891 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:01:42] iteration        4/  500000 | consumed samples:          512 | elapsed time per iteration (ms): 9702.8 | throughput per GPU (TFLOP/s/GPU): 134.6 | learning rate: 8.000E-07 | global batch size:   128 | lm loss: 9.807467E+00 | loss scale: 1.0 | grad norm: 517.413 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:01:51] iteration        5/  500000 | consumed samples:          640 | elapsed time per iteration (ms): 9683.9 | throughput per GPU (TFLOP/s/GPU): 134.9 | learning rate: 1.000E-06 | global batch size:   128 | lm loss: 7.764119E+00 | loss scale: 1.0 | grad norm: 492.510 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:01] iteration        6/  500000 | consumed samples:          768 | elapsed time per iteration (ms): 9675.6 | throughput per GPU (TFLOP/s/GPU): 135.0 | learning rate: 1.200E-06 | global batch size:   128 | lm loss: 2.630678E+00 | loss scale: 1.0 | grad norm: 323.002 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:11] iteration        7/  500000 | consumed samples:          896 | elapsed time per iteration (ms): 9454.1 | throughput per GPU (TFLOP/s/GPU): 138.1 | learning rate: 1.400E-06 | global batch size:   128 | lm loss: 1.398795E+00 | loss scale: 1.0 | grad norm: 213.771 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:20] iteration        8/  500000 | consumed samples:         1024 | elapsed time per iteration (ms): 9471.4 | throughput per GPU (TFLOP/s/GPU): 137.9 | learning rate: 1.600E-06 | global batch size:   128 | lm loss: 1.726107E+00 | loss scale: 1.0 | grad norm: 420.698 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:30] iteration        9/  500000 | consumed samples:         1152 | elapsed time per iteration (ms): 10085.2 | throughput per GPU (TFLOP/s/GPU): 129.5 | learning rate: 1.800E-06 | global batch size:   128 | lm loss: 2.890289E-01 | loss scale: 1.0 | grad norm: 83.644 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:40] iteration       10/  500000 | consumed samples:         1280 | elapsed time per iteration (ms): 9496.4 | throughput per GPU (TFLOP/s/GPU): 137.5 | learning rate: 2.000E-06 | global batch size:   128 | lm loss: 2.092005E-01 | loss scale: 1.0 | grad norm: 51.010 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:02:50] iteration       11/  500000 | consumed samples:         1408 | elapsed time per iteration (ms): 10036.9 | throughput per GPU (TFLOP/s/GPU): 130.1 | learning rate: 2.200E-06 | global batch size:   128 | lm loss: 2.352597E-01 | loss scale: 1.0 | grad norm: 106.730 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:00] iteration       12/  500000 | consumed samples:         1536 | elapsed time per iteration (ms): 10198.4 | throughput per GPU (TFLOP/s/GPU): 128.1 | learning rate: 2.400E-06 | global batch size:   128 | lm loss: 7.243721E-01 | loss scale: 1.0 | grad norm: 163.466 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:10] iteration       13/  500000 | consumed samples:         1664 | elapsed time per iteration (ms): 10269.3 | throughput per GPU (TFLOP/s/GPU): 127.2 | learning rate: 2.600E-06 | global batch size:   128 | lm loss: 1.757669E+00 | loss scale: 1.0 | grad norm: 356.809 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:21] iteration       14/  500000 | consumed samples:         1792 | elapsed time per iteration (ms): 10330.7 | throughput per GPU (TFLOP/s/GPU): 126.4 | learning rate: 2.800E-06 | global batch size:   128 | lm loss: 2.853365E-01 | loss scale: 1.0 | grad norm: 93.354 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:31] iteration       15/  500000 | consumed samples:         1920 | elapsed time per iteration (ms): 10106.2 | throughput per GPU (TFLOP/s/GPU): 129.2 | learning rate: 3.000E-06 | global batch size:   128 | lm loss: 5.018836E-01 | loss scale: 1.0 | grad norm: 165.646 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:41] iteration       16/  500000 | consumed samples:         2048 | elapsed time per iteration (ms): 10102.3 | throughput per GPU (TFLOP/s/GPU): 129.3 | learning rate: 3.200E-06 | global batch size:   128 | lm loss: 9.302688E-01 | loss scale: 1.0 | grad norm: 170.065 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-03-30 20:03:51] iteration       17/  500000 | consumed samples:         2176 | elapsed time per iteration (ms): 9946.1 | throughput per GPU (TFLOP/s/GPU): 131.3 | learning rate: 3.400E-06 | global batch size:   128 | lm loss: 8.015128E-02 | loss scale: 1.0 | grad norm: 47.780 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |

Answer 1 · 2024-04-05T08:42:58.000Z

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100.
I quickly reviewed your script and have some suggestions:

Update the code to the latest main branch and upgrade grouped_gemm to v1.0.
Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.
Use EP8TP2.
Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.

Answer 2 · 2024-04-05T19:25:22.000Z

Hi, thanks for the suggestions.
I retested the throuput according to your suggestion.
To be more specific:

Update Megatron-LM the latest commit (ba77325)
Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)
Set --moe-token-dispatcher-type alltoall
Switch to EP=8 & TP=2
Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)

The throughput has indeed increased significantly, reaching around 230 TFLOP/s.
However, for H100, it's still pretty low, isn't it?
May I ask, theoretically, what would be a more reasonable throughput?

Here is the logs

using world size: 16, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:Llama2Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  ckpt_fully_parallel_save ........................ False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... ['custom/data/wudao/wudao_mistralbpe_content_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 1
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 8
  ffn_hidden_size ................................. 14336
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 128
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... True
  log_throughput .................................. True
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 500
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 32768
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.01
  moe_grouped_gemm ................................ True
  moe_input_jitter_eps ............................ None
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... alltoall
  moe_token_dropping .............................. False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... 8
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 8
  num_workers ..................................... 2
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pretrained_checkpoint ........................... None
  profile ......................................... True
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
  save_interval ................................... 1000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 99990,8,2
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. custom/ckpt/mixtral-8x7b-tp2-ep8-mgg/tensorboard
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. custom/ckpt/mixtral-8x7b/tokenizer.model
  tokenizer_type .................................. Llama2Tokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 100
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_mcore_models ................................ True
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 16
> building Llama2Tokenizer tokenizer ...
 > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
make: Entering directory '.../Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '.../Megatron-LM/megatron/core/datasets'
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.104 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 7.866 seconds
[rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 14.235
[after megatron is initialized] datetime: 2024-04-06 02:54:57 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3622047744
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3622047744
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (803475456 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.final_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.output_layer.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight1
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=0.0001, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b366837d3f0>)
> learning rate decay style: cosine
 loading release checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
could not find arguments in the checkpoint ...
 checkpoint version 0
 succesfully fixed query-key-values ordering for checkpoint version 0
  successfully loaded checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg [ t 0, p 0 ] at iteration 0
> setting tensorboard ...
(min, max) time across ranks (ms):
    load-checkpoint ................................: (8126.15, 8126.65)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-04-06 02:55:06 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      12800
    validation: 128
    test:       128
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.9999), (0.9999, 0.99998), (0.99998, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from custom/data/wudao/wudao_mistralbpe_content_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 59132211
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 59132211
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 40201537
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a625518736b8143e22f4f34c6682183e-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a625518736b8143e22f4f34c6682183e-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a625518736b8143e22f4f34c6682183e-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6204
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2332
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-06 02:55:07 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (8592.94, 8605.02)
    train/valid/test-data-iterators-setup ..........: (569.02, 865.21)
training ...
[before the start of training step] datetime: 2024-04-06 02:55:07 
Number of parameters in transformer layers in billions:  46.44
Number of parameters in embedding layers in billions: 0.26
Total number of parameters in billions: 46.70
Number of parameters in most loaded shard in billions: 23.3510
Theoretical memory footprints: weight and optimizer=167019.40 MB
[Rank 0] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61470.0 | max reserved: 61470.0
 [2024-04-06 02:55:39] iteration        1/     100 | consumed samples:          128 | elapsed time per iteration (ms): 32269.4 | throughput per GPU (TFLOP/s/GPU): 40.5 | learning rate: 2.000000E-07 | global batch size:   128 | lm loss: 1.985617E+00 | load_balancing_loss: 1.089786E+00 | loss scale: 1.0 | grad norm: 6.396 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
[Rank 1] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61480.0 | max reserved: 61480.0
 [2024-04-06 02:55:45] iteration        2/     100 | consumed samples:          256 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 231.9 | learning rate: 4.000000E-07 | global batch size:   128 | lm loss: 2.021530E+00 | load_balancing_loss: 1.087362E+00 | loss scale: 1.0 | grad norm: 6.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:55:50] iteration        3/     100 | consumed samples:          384 | elapsed time per iteration (ms): 5410.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 6.000000E-07 | global batch size:   128 | lm loss: 2.003316E+00 | load_balancing_loss: 1.085377E+00 | loss scale: 1.0 | grad norm: 6.603 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:55:55] iteration        4/     100 | consumed samples:          512 | elapsed time per iteration (ms): 5364.1 | throughput per GPU (TFLOP/s/GPU): 243.5 | learning rate: 8.000000E-07 | global batch size:   128 | lm loss: 2.009657E+00 | load_balancing_loss: 1.091695E+00 | loss scale: 1.0 | grad norm: 6.619 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:01] iteration        5/     100 | consumed samples:          640 | elapsed time per iteration (ms): 5496.7 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.000000E-06 | global batch size:   128 | lm loss: 2.002326E+00 | load_balancing_loss: 1.091539E+00 | loss scale: 1.0 | grad norm: 6.612 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:06] iteration        6/     100 | consumed samples:          768 | elapsed time per iteration (ms): 5364.8 | throughput per GPU (TFLOP/s/GPU): 243.4 | learning rate: 1.200000E-06 | global batch size:   128 | lm loss: 1.933151E+00 | load_balancing_loss: 1.086472E+00 | loss scale: 1.0 | grad norm: 5.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:12] iteration        7/     100 | consumed samples:          896 | elapsed time per iteration (ms): 5682.7 | throughput per GPU (TFLOP/s/GPU): 229.8 | learning rate: 1.400000E-06 | global batch size:   128 | lm loss: 2.016085E+00 | load_balancing_loss: 1.085193E+00 | loss scale: 1.0 | grad norm: 5.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:17] iteration        8/     100 | consumed samples:         1024 | elapsed time per iteration (ms): 5408.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 1.600000E-06 | global batch size:   128 | lm loss: 1.965713E+00 | load_balancing_loss: 1.080933E+00 | loss scale: 1.0 | grad norm: 4.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:23] iteration        9/     100 | consumed samples:         1152 | elapsed time per iteration (ms): 5590.1 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.800000E-06 | global batch size:   128 | lm loss: 1.919308E+00 | load_balancing_loss: 1.089582E+00 | loss scale: 1.0 | grad norm: 4.267 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:28] iteration       10/     100 | consumed samples:         1280 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 2.000000E-06 | global batch size:   128 | lm loss: 1.978377E+00 | load_balancing_loss: 1.089948E+00 | loss scale: 1.0 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:34] iteration       11/     100 | consumed samples:         1408 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 218.2 | learning rate: 2.200000E-06 | global batch size:   128 | lm loss: 1.889895E+00 | load_balancing_loss: 1.083618E+00 | loss scale: 1.0 | grad norm: 3.361 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:40] iteration       12/     100 | consumed samples:         1536 | elapsed time per iteration (ms): 5821.8 | throughput per GPU (TFLOP/s/GPU): 224.3 | learning rate: 2.400000E-06 | global batch size:   128 | lm loss: 1.932808E+00 | load_balancing_loss: 1.085315E+00 | loss scale: 1.0 | grad norm: 3.336 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:46] iteration       13/     100 | consumed samples:         1664 | elapsed time per iteration (ms): 5962.2 | throughput per GPU (TFLOP/s/GPU): 219.0 | learning rate: 2.600000E-06 | global batch size:   128 | lm loss: 1.911683E+00 | load_balancing_loss: 1.079515E+00 | loss scale: 1.0 | grad norm: 3.183 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:52] iteration       14/     100 | consumed samples:         1792 | elapsed time per iteration (ms): 5927.4 | throughput per GPU (TFLOP/s/GPU): 220.3 | learning rate: 2.800000E-06 | global batch size:   128 | lm loss: 1.913695E+00 | load_balancing_loss: 1.076165E+00 | loss scale: 1.0 | grad norm: 2.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:58] iteration       15/     100 | consumed samples:         1920 | elapsed time per iteration (ms): 5926.4 | throughput per GPU (TFLOP/s/GPU): 220.4 | learning rate: 3.000000E-06 | global batch size:   128 | lm loss: 1.957101E+00 | load_balancing_loss: 1.069903E+00 | loss scale: 1.0 | grad norm: 2.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:04] iteration       16/     100 | consumed samples:         2048 | elapsed time per iteration (ms): 5912.7 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 3.200000E-06 | global batch size:   128 | lm loss: 1.915763E+00 | load_balancing_loss: 1.065748E+00 | loss scale: 1.0 | grad norm: 2.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:10] iteration       17/     100 | consumed samples:         2176 | elapsed time per iteration (ms): 5706.3 | throughput per GPU (TFLOP/s/GPU): 228.9 | learning rate: 3.400000E-06 | global batch size:   128 | lm loss: 1.918353E+00 | load_balancing_loss: 1.064678E+00 | loss scale: 1.0 | grad norm: 2.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:15] iteration       18/     100 | consumed samples:         2304 | elapsed time per iteration (ms): 5732.8 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 3.600000E-06 | global batch size:   128 | lm loss: 1.861051E+00 | load_balancing_loss: 1.058054E+00 | loss scale: 1.0 | grad norm: 2.449 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:21] iteration       19/     100 | consumed samples:         2432 | elapsed time per iteration (ms): 5684.9 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 3.800000E-06 | global batch size:   128 | lm loss: 1.934895E+00 | load_balancing_loss: 1.049081E+00 | loss scale: 1.0 | grad norm: 2.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:27] iteration       20/     100 | consumed samples:         2560 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 226.3 | learning rate: 4.000000E-06 | global batch size:   128 | lm loss: 1.932632E+00 | load_balancing_loss: 1.052491E+00 | loss scale: 1.0 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:32] iteration       21/     100 | consumed samples:         2688 | elapsed time per iteration (ms): 5541.8 | throughput per GPU (TFLOP/s/GPU): 235.6 | learning rate: 4.200000E-06 | global batch size:   128 | lm loss: 1.904877E+00 | load_balancing_loss: 1.047207E+00 | loss scale: 1.0 | grad norm: 2.213 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:38] iteration       22/     100 | consumed samples:         2816 | elapsed time per iteration (ms): 5576.7 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 4.400000E-06 | global batch size:   128 | lm loss: 1.872380E+00 | load_balancing_loss: 1.039512E+00 | loss scale: 1.0 | grad norm: 2.116 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:44] iteration       23/     100 | consumed samples:         2944 | elapsed time per iteration (ms): 5807.4 | throughput per GPU (TFLOP/s/GPU): 224.9 | learning rate: 4.600000E-06 | global batch size:   128 | lm loss: 1.835408E+00 | load_balancing_loss: 1.042104E+00 | loss scale: 1.0 | grad norm: 2.034 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:50] iteration       24/     100 | consumed samples:         3072 | elapsed time per iteration (ms): 5727.3 | throughput per GPU (TFLOP/s/GPU): 228.0 | learning rate: 4.800000E-06 | global batch size:   128 | lm loss: 1.898657E+00 | load_balancing_loss: 1.029742E+00 | loss scale: 1.0 | grad norm: 1.982 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:55] iteration       25/     100 | consumed samples:         3200 | elapsed time per iteration (ms): 5498.4 | throughput per GPU (TFLOP/s/GPU): 237.5 | learning rate: 5.000000E-06 | global batch size:   128 | lm loss: 1.904866E+00 | load_balancing_loss: 1.034888E+00 | loss scale: 1.0 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:01] iteration       26/     100 | consumed samples:         3328 | elapsed time per iteration (ms): 5531.7 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 5.200000E-06 | global batch size:   128 | lm loss: 1.889752E+00 | load_balancing_loss: 1.028931E+00 | loss scale: 1.0 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:06] iteration       27/     100 | consumed samples:         3456 | elapsed time per iteration (ms): 5678.3 | throughput per GPU (TFLOP/s/GPU): 230.0 | learning rate: 5.400000E-06 | global batch size:   128 | lm loss: 1.866109E+00 | load_balancing_loss: 1.031736E+00 | loss scale: 1.0 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:12] iteration       28/     100 | consumed samples:         3584 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 5.600000E-06 | global batch size:   128 | lm loss: 1.914117E+00 | load_balancing_loss: 1.027364E+00 | loss scale: 1.0 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:18] iteration       29/     100 | consumed samples:         3712 | elapsed time per iteration (ms): 5912.1 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 5.800000E-06 | global batch size:   128 | lm loss: 1.867856E+00 | load_balancing_loss: 1.023825E+00 | loss scale: 1.0 | grad norm: 1.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:23] iteration       30/     100 | consumed samples:         3840 | elapsed time per iteration (ms): 5571.1 | throughput per GPU (TFLOP/s/GPU): 234.4 | learning rate: 6.000000E-06 | global batch size:   128 | lm loss: 1.924535E+00 | load_balancing_loss: 1.025294E+00 | loss scale: 1.0 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:29] iteration       31/     100 | consumed samples:         3968 | elapsed time per iteration (ms): 5718.9 | throughput per GPU (TFLOP/s/GPU): 228.3 | learning rate: 6.200000E-06 | global batch size:   128 | lm loss: 1.830754E+00 | load_balancing_loss: 1.028048E+00 | loss scale: 1.0 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:35] iteration       32/     100 | consumed samples:         4096 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 232.0 | learning rate: 6.400000E-06 | global batch size:   128 | lm loss: 1.848776E+00 | load_balancing_loss: 1.021549E+00 | loss scale: 1.0 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:40] iteration       33/     100 | consumed samples:         4224 | elapsed time per iteration (ms): 5600.4 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 6.600000E-06 | global batch size:   128 | lm loss: 1.917658E+00 | load_balancing_loss: 1.032319E+00 | loss scale: 1.0 | grad norm: 1.519 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:46] iteration       34/     100 | consumed samples:         4352 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 231.4 | learning rate: 6.800000E-06 | global batch size:   128 | lm loss: 1.844636E+00 | load_balancing_loss: 1.019185E+00 | loss scale: 1.0 | grad norm: 1.626 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:51] iteration       35/     100 | consumed samples:         4480 | elapsed time per iteration (ms): 5367.8 | throughput per GPU (TFLOP/s/GPU): 243.3 | learning rate: 7.000000E-06 | global batch size:   128 | lm loss: 1.853418E+00 | load_balancing_loss: 1.020990E+00 | loss scale: 1.0 | grad norm: 1.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:57] iteration       36/     100 | consumed samples:         4608 | elapsed time per iteration (ms): 5399.9 | throughput per GPU (TFLOP/s/GPU): 241.8 | learning rate: 7.200000E-06 | global batch size:   128 | lm loss: 1.842918E+00 | load_balancing_loss: 1.023077E+00 | loss scale: 1.0 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:02] iteration       37/     100 | consumed samples:         4736 | elapsed time per iteration (ms): 5515.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 7.400000E-06 | global batch size:   128 | lm loss: 1.862270E+00 | load_balancing_loss: 1.023782E+00 | loss scale: 1.0 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:08] iteration       38/     100 | consumed samples:         4864 | elapsed time per iteration (ms): 5477.8 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 7.600000E-06 | global batch size:   128 | lm loss: 1.862543E+00 | load_balancing_loss: 1.019304E+00 | loss scale: 1.0 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:13] iteration       39/     100 | consumed samples:         4992 | elapsed time per iteration (ms): 5649.1 | throughput per GPU (TFLOP/s/GPU): 231.2 | learning rate: 7.800000E-06 | global batch size:   128 | lm loss: 1.863421E+00 | load_balancing_loss: 1.017805E+00 | loss scale: 1.0 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:19] iteration       40/     100 | consumed samples:         5120 | elapsed time per iteration (ms): 5810.4 | throughput per GPU (TFLOP/s/GPU): 224.8 | learning rate: 8.000000E-06 | global batch size:   128 | lm loss: 1.879655E+00 | load_balancing_loss: 1.017568E+00 | loss scale: 1.0 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:25] iteration       41/     100 | consumed samples:         5248 | elapsed time per iteration (ms): 5462.9 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 8.200000E-06 | global batch size:   128 | lm loss: 1.812076E+00 | load_balancing_loss: 1.020508E+00 | loss scale: 1.0 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:30] iteration       42/     100 | consumed samples:         5376 | elapsed time per iteration (ms): 5452.3 | throughput per GPU (TFLOP/s/GPU): 239.5 | learning rate: 8.400000E-06 | global batch size:   128 | lm loss: 1.824542E+00 | load_balancing_loss: 1.017472E+00 | loss scale: 1.0 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:36] iteration       43/     100 | consumed samples:         5504 | elapsed time per iteration (ms): 5444.9 | throughput per GPU (TFLOP/s/GPU): 239.8 | learning rate: 8.600000E-06 | global batch size:   128 | lm loss: 1.825991E+00 | load_balancing_loss: 1.019746E+00 | loss scale: 1.0 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:41] iteration       44/     100 | consumed samples:         5632 | elapsed time per iteration (ms): 5533.8 | throughput per GPU (TFLOP/s/GPU): 236.0 | learning rate: 8.800000E-06 | global batch size:   128 | lm loss: 1.875063E+00 | load_balancing_loss: 1.020033E+00 | loss scale: 1.0 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:47] iteration       45/     100 | consumed samples:         5760 | elapsed time per iteration (ms): 5718.6 | throughput per GPU (TFLOP/s/GPU): 228.4 | learning rate: 9.000000E-06 | global batch size:   128 | lm loss: 1.834162E+00 | load_balancing_loss: 1.018004E+00 | loss scale: 1.0 | grad norm: 1.611 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:52] iteration       46/     100 | consumed samples:         5888 | elapsed time per iteration (ms): 5567.2 | throughput per GPU (TFLOP/s/GPU): 234.6 | learning rate: 9.200000E-06 | global batch size:   128 | lm loss: 1.883577E+00 | load_balancing_loss: 1.016062E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:58] iteration       47/     100 | consumed samples:         6016 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 9.400000E-06 | global batch size:   128 | lm loss: 1.836727E+00 | load_balancing_loss: 1.019520E+00 | loss scale: 1.0 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:04] iteration       48/     100 | consumed samples:         6144 | elapsed time per iteration (ms): 5872.4 | throughput per GPU (TFLOP/s/GPU): 222.4 | learning rate: 9.600000E-06 | global batch size:   128 | lm loss: 1.855191E+00 | load_balancing_loss: 1.017754E+00 | loss scale: 1.0 | grad norm: 1.508 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:09] iteration       49/     100 | consumed samples:         6272 | elapsed time per iteration (ms): 5528.7 | throughput per GPU (TFLOP/s/GPU): 236.2 | learning rate: 9.800000E-06 | global batch size:   128 | lm loss: 1.806294E+00 | load_balancing_loss: 1.017504E+00 | loss scale: 1.0 | grad norm: 1.529 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:15] iteration       50/     100 | consumed samples:         6400 | elapsed time per iteration (ms): 5531.5 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 1.000000E-05 | global batch size:   128 | lm loss: 1.887587E+00 | load_balancing_loss: 1.016094E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:21] iteration       51/     100 | consumed samples:         6528 | elapsed time per iteration (ms): 5501.3 | throughput per GPU (TFLOP/s/GPU): 237.4 | learning rate: 1.020000E-05 | global batch size:   128 | lm loss: 1.834414E+00 | load_balancing_loss: 1.015084E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:26] iteration       52/     100 | consumed samples:         6656 | elapsed time per iteration (ms): 5520.9 | throughput per GPU (TFLOP/s/GPU): 236.5 | learning rate: 1.040000E-05 | global batch size:   128 | lm loss: 1.847078E+00 | load_balancing_loss: 1.015950E+00 | loss scale: 1.0 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:32] iteration       53/     100 | consumed samples:         6784 | elapsed time per iteration (ms): 5711.6 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.060000E-05 | global batch size:   128 | lm loss: 1.862840E+00 | load_balancing_loss: 1.016317E+00 | loss scale: 1.0 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:37] iteration       54/     100 | consumed samples:         6912 | elapsed time per iteration (ms): 5689.4 | throughput per GPU (TFLOP/s/GPU): 229.5 | learning rate: 1.080000E-05 | global batch size:   128 | lm loss: 1.897956E+00 | load_balancing_loss: 1.017408E+00 | loss scale: 1.0 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:43] iteration       55/     100 | consumed samples:         7040 | elapsed time per iteration (ms): 5763.8 | throughput per GPU (TFLOP/s/GPU): 226.6 | learning rate: 1.100000E-05 | global batch size:   128 | lm loss: 1.863309E+00 | load_balancing_loss: 1.014457E+00 | loss scale: 1.0 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:49] iteration       56/     100 | consumed samples:         7168 | elapsed time per iteration (ms): 5742.1 | throughput per GPU (TFLOP/s/GPU): 227.4 | learning rate: 1.120000E-05 | global batch size:   128 | lm loss: 1.899538E+00 | load_balancing_loss: 1.018558E+00 | loss scale: 1.0 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:54] iteration       57/     100 | consumed samples:         7296 | elapsed time per iteration (ms): 5450.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.140000E-05 | global batch size:   128 | lm loss: 1.864605E+00 | load_balancing_loss: 1.015150E+00 | loss scale: 1.0 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:00] iteration       58/     100 | consumed samples:         7424 | elapsed time per iteration (ms): 5538.9 | throughput per GPU (TFLOP/s/GPU): 235.8 | learning rate: 1.160000E-05 | global batch size:   128 | lm loss: 1.812579E+00 | load_balancing_loss: 1.020851E+00 | loss scale: 1.0 | grad norm: 1.610 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:05] iteration       59/     100 | consumed samples:         7552 | elapsed time per iteration (ms): 5410.9 | throughput per GPU (TFLOP/s/GPU): 241.3 | learning rate: 1.180000E-05 | global batch size:   128 | lm loss: 1.848337E+00 | load_balancing_loss: 1.013638E+00 | loss scale: 1.0 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:11] iteration       60/     100 | consumed samples:         7680 | elapsed time per iteration (ms): 5603.1 | throughput per GPU (TFLOP/s/GPU): 233.1 | learning rate: 1.200000E-05 | global batch size:   128 | lm loss: 1.801180E+00 | load_balancing_loss: 1.019084E+00 | loss scale: 1.0 | grad norm: 1.549 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:16] iteration       61/     100 | consumed samples:         7808 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.220000E-05 | global batch size:   128 | lm loss: 1.813972E+00 | load_balancing_loss: 1.014779E+00 | loss scale: 1.0 | grad norm: 1.427 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:22] iteration       62/     100 | consumed samples:         7936 | elapsed time per iteration (ms): 5753.1 | throughput per GPU (TFLOP/s/GPU): 227.0 | learning rate: 1.240000E-05 | global batch size:   128 | lm loss: 1.808689E+00 | load_balancing_loss: 1.022012E+00 | loss scale: 1.0 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:28] iteration       63/     100 | consumed samples:         8064 | elapsed time per iteration (ms): 5650.1 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 1.260000E-05 | global batch size:   128 | lm loss: 1.781526E+00 | load_balancing_loss: 1.013716E+00 | loss scale: 1.0 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:33] iteration       64/     100 | consumed samples:         8192 | elapsed time per iteration (ms): 5539.7 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.280000E-05 | global batch size:   128 | lm loss: 1.871476E+00 | load_balancing_loss: 1.019044E+00 | loss scale: 1.0 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:39] iteration       65/     100 | consumed samples:         8320 | elapsed time per iteration (ms): 5493.9 | throughput per GPU (TFLOP/s/GPU): 237.7 | learning rate: 1.300000E-05 | global batch size:   128 | lm loss: 1.846450E+00 | load_balancing_loss: 1.017387E+00 | loss scale: 1.0 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:44] iteration       66/     100 | consumed samples:         8448 | elapsed time per iteration (ms): 5590.8 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.320000E-05 | global batch size:   128 | lm loss: 1.873755E+00 | load_balancing_loss: 1.014257E+00 | loss scale: 1.0 | grad norm: 1.411 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:50] iteration       67/     100 | consumed samples:         8576 | elapsed time per iteration (ms): 5710.3 | throughput per GPU (TFLOP/s/GPU): 228.7 | learning rate: 1.340000E-05 | global batch size:   128 | lm loss: 1.765591E+00 | load_balancing_loss: 1.016482E+00 | loss scale: 1.0 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:56] iteration       68/     100 | consumed samples:         8704 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 227.7 | learning rate: 1.360000E-05 | global batch size:   128 | lm loss: 1.839895E+00 | load_balancing_loss: 1.012786E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:01] iteration       69/     100 | consumed samples:         8832 | elapsed time per iteration (ms): 5478.6 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 1.380000E-05 | global batch size:   128 | lm loss: 1.912256E+00 | load_balancing_loss: 1.013041E+00 | loss scale: 1.0 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:07] iteration       70/     100 | consumed samples:         8960 | elapsed time per iteration (ms): 5514.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 1.400000E-05 | global batch size:   128 | lm loss: 1.873068E+00 | load_balancing_loss: 1.012509E+00 | loss scale: 1.0 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:12] iteration       71/     100 | consumed samples:         9088 | elapsed time per iteration (ms): 5361.6 | throughput per GPU (TFLOP/s/GPU): 243.6 | learning rate: 1.420000E-05 | global batch size:   128 | lm loss: 1.818812E+00 | load_balancing_loss: 1.013377E+00 | loss scale: 1.0 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:18] iteration       72/     100 | consumed samples:         9216 | elapsed time per iteration (ms): 5470.7 | throughput per GPU (TFLOP/s/GPU): 238.7 | learning rate: 1.440000E-05 | global batch size:   128 | lm loss: 1.820313E+00 | load_balancing_loss: 1.019612E+00 | loss scale: 1.0 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:24] iteration       73/     100 | consumed samples:         9344 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 224.0 | learning rate: 1.460000E-05 | global batch size:   128 | lm loss: 1.798953E+00 | load_balancing_loss: 1.010977E+00 | loss scale: 1.0 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:29] iteration       74/     100 | consumed samples:         9472 | elapsed time per iteration (ms): 5702.4 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.480000E-05 | global batch size:   128 | lm loss: 1.774078E+00 | load_balancing_loss: 1.012441E+00 | loss scale: 1.0 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:35] iteration       75/     100 | consumed samples:         9600 | elapsed time per iteration (ms): 5599.5 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.500000E-05 | global batch size:   128 | lm loss: 1.838492E+00 | load_balancing_loss: 1.015038E+00 | loss scale: 1.0 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:40] iteration       76/     100 | consumed samples:         9728 | elapsed time per iteration (ms): 5588.2 | throughput per GPU (TFLOP/s/GPU): 233.7 | learning rate: 1.520000E-05 | global batch size:   128 | lm loss: 1.860703E+00 | load_balancing_loss: 1.012689E+00 | loss scale: 1.0 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:46] iteration       77/     100 | consumed samples:         9856 | elapsed time per iteration (ms): 5425.4 | throughput per GPU (TFLOP/s/GPU): 240.7 | learning rate: 1.540000E-05 | global batch size:   128 | lm loss: 1.827507E+00 | load_balancing_loss: 1.012502E+00 | loss scale: 1.0 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:52] iteration       78/     100 | consumed samples:         9984 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 231.0 | learning rate: 1.560000E-05 | global batch size:   128 | lm loss: 1.784492E+00 | load_balancing_loss: 1.013809E+00 | loss scale: 1.0 | grad norm: 1.407 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:57] iteration       79/     100 | consumed samples:        10112 | elapsed time per iteration (ms): 5577.0 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 1.580000E-05 | global batch size:   128 | lm loss: 1.858489E+00 | load_balancing_loss: 1.011662E+00 | loss scale: 1.0 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:03] iteration       80/     100 | consumed samples:        10240 | elapsed time per iteration (ms): 5712.8 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.600000E-05 | global batch size:   128 | lm loss: 1.842588E+00 | load_balancing_loss: 1.011640E+00 | loss scale: 1.0 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:09] iteration       81/     100 | consumed samples:        10368 | elapsed time per iteration (ms): 5684.5 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 1.620000E-05 | global batch size:   128 | lm loss: 1.818980E+00 | load_balancing_loss: 1.012697E+00 | loss scale: 1.0 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:14] iteration       82/     100 | consumed samples:        10496 | elapsed time per iteration (ms): 5592.0 | throughput per GPU (TFLOP/s/GPU): 233.5 | learning rate: 1.640000E-05 | global batch size:   128 | lm loss: 1.805010E+00 | load_balancing_loss: 1.012805E+00 | loss scale: 1.0 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:20] iteration       83/     100 | consumed samples:        10624 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 231.5 | learning rate: 1.660000E-05 | global batch size:   128 | lm loss: 1.812314E+00 | load_balancing_loss: 1.011967E+00 | loss scale: 1.0 | grad norm: 1.530 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:25] iteration       84/     100 | consumed samples:        10752 | elapsed time per iteration (ms): 5563.7 | throughput per GPU (TFLOP/s/GPU): 234.7 | learning rate: 1.680000E-05 | global batch size:   128 | lm loss: 1.822110E+00 | load_balancing_loss: 1.009684E+00 | loss scale: 1.0 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:31] iteration       85/     100 | consumed samples:        10880 | elapsed time per iteration (ms): 5580.9 | throughput per GPU (TFLOP/s/GPU): 234.0 | learning rate: 1.700000E-05 | global batch size:   128 | lm loss: 1.831795E+00 | load_balancing_loss: 1.009344E+00 | loss scale: 1.0 | grad norm: 1.578 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:37] iteration       86/     100 | consumed samples:        11008 | elapsed time per iteration (ms): 5695.8 | throughput per GPU (TFLOP/s/GPU): 229.3 | learning rate: 1.720000E-05 | global batch size:   128 | lm loss: 1.831625E+00 | load_balancing_loss: 1.011533E+00 | loss scale: 1.0 | grad norm: 1.515 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:42] iteration       87/     100 | consumed samples:        11136 | elapsed time per iteration (ms): 5444.5 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.740000E-05 | global batch size:   128 | lm loss: 1.814374E+00 | load_balancing_loss: 1.010052E+00 | loss scale: 1.0 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:48] iteration       88/     100 | consumed samples:        11264 | elapsed time per iteration (ms): 5462.7 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 1.760000E-05 | global batch size:   128 | lm loss: 1.825778E+00 | load_balancing_loss: 1.010838E+00 | loss scale: 1.0 | grad norm: 1.506 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:53] iteration       89/     100 | consumed samples:        11392 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 231.8 | learning rate: 1.780000E-05 | global batch size:   128 | lm loss: 1.818898E+00 | load_balancing_loss: 1.011014E+00 | loss scale: 1.0 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:59] iteration       90/     100 | consumed samples:        11520 | elapsed time per iteration (ms): 5567.8 | throughput per GPU (TFLOP/s/GPU): 234.5 | learning rate: 1.800000E-05 | global batch size:   128 | lm loss: 1.813602E+00 | load_balancing_loss: 1.022434E+00 | loss scale: 1.0 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:04] iteration       91/     100 | consumed samples:        11648 | elapsed time per iteration (ms): 5691.9 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 1.820000E-05 | global batch size:   128 | lm loss: 1.797111E+00 | load_balancing_loss: 1.011964E+00 | loss scale: 1.0 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:10] iteration       92/     100 | consumed samples:        11776 | elapsed time per iteration (ms): 5451.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.840000E-05 | global batch size:   128 | lm loss: 1.809117E+00 | load_balancing_loss: 1.012038E+00 | loss scale: 1.0 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:15] iteration       93/     100 | consumed samples:        11904 | elapsed time per iteration (ms): 5599.2 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.860000E-05 | global batch size:   128 | lm loss: 1.797812E+00 | load_balancing_loss: 1.011838E+00 | loss scale: 1.0 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:21] iteration       94/     100 | consumed samples:        12032 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.880000E-05 | global batch size:   128 | lm loss: 1.865515E+00 | load_balancing_loss: 1.013109E+00 | loss scale: 1.0 | grad norm: 1.603 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:26] iteration       95/     100 | consumed samples:        12160 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.900000E-05 | global batch size:   128 | lm loss: 1.845348E+00 | load_balancing_loss: 1.012796E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:32] iteration       96/     100 | consumed samples:        12288 | elapsed time per iteration (ms): 5702.2 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.920000E-05 | global batch size:   128 | lm loss: 1.843516E+00 | load_balancing_loss: 1.010116E+00 | loss scale: 1.0 | grad norm: 1.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:38] iteration       97/     100 | consumed samples:        12416 | elapsed time per iteration (ms): 5733.2 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 1.940000E-05 | global batch size:   128 | lm loss: 1.876754E+00 | load_balancing_loss: 1.011542E+00 | loss scale: 1.0 | grad norm: 1.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:43] iteration       98/     100 | consumed samples:        12544 | elapsed time per iteration (ms): 5556.4 | throughput per GPU (TFLOP/s/GPU): 235.0 | learning rate: 1.960000E-05 | global batch size:   128 | lm loss: 1.810738E+00 | load_balancing_loss: 1.010371E+00 | loss scale: 1.0 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:49] iteration       99/     100 | consumed samples:        12672 | elapsed time per iteration (ms): 5523.5 | throughput per GPU (TFLOP/s/GPU): 236.4 | learning rate: 1.980000E-05 | global batch size:   128 | lm loss: 1.872008E+00 | load_balancing_loss: 1.008882E+00 | loss scale: 1.0 | grad norm: 1.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:54] iteration      100/     100 | consumed samples:        12800 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 2.000000E-05 | global batch size:   128 | lm loss: 1.824753E+00 | load_balancing_loss: 1.009905E+00 | loss scale: 1.0 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
[after training is done] datetime: 2024-04-06 03:04:54

Answer 3 · 2024-04-06T06:31:49.000Z

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:

Update the code to the latest main branch and upgrade grouped_gemm to v1.0.

Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.

Use EP8TP2.

Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.

If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?

Answer 4 · 2024-04-11T04:11:06.000Z

Hi, thanks for the suggestions. I retested the throuput according to your suggestion. To be more specific:

Update Megatron-LM the latest commit (ba77325)

Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)

Set --moe-token-dispatcher-type alltoall

Switch to EP=8 & TP=2

Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)

The throughput has indeed increased significantly, reaching around 230 TFLOP/s. However, for H100, it's still pretty low, isn't it? May I ask, theoretically, what would be a more reasonable throughput?

Here is the logs

Apologies for the delayed reply. 230 TFLOPS falls below our expectations; Currently, we can exceed 330TFLOPS on the H100 and potentially higher by switching to EP8TP1 with re-computation.

Answer 5 · 2024-04-11T13:11:03.000Z

Does that mean you can achieve over 330 TFLOPS in the same or similar software environment and settings?
Should I then suspect hardware-related issues, such as network speeds between nodes?

Answer 6 · 2024-04-12T09:00:20.000Z

Hi @ShinoharaHare , our env is:

DGX H100, 64 GPUs.
pytorch 24.03 image..

I double-checked your scripts and suggest the following modifications:

Seq Len: 2048 -> 4096
enable dp overlap: --overlap-grad-reduce --overlap-param-gather

Let's see how performance changes after these changes ^ ^.

Answer 7 · 2024-04-12T09:03:42.000Z

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:

Update the code to the latest main branch and upgrade grouped_gemm to v1.0.

Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.

Use EP8TP2.

Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.

If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?

Hi XLZed, MCore MoE does support PP, but for the Mixtral 8x7B model, we prefer EP and TP.

Answer 8 · 2024-04-12T14:27:52.000Z

Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:

Update the code to the latest main branch and upgrade grouped_gemm to v1.0.

Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.

Use EP8TP2.

Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.

If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?

Hi XLZed, MCore MoE does support PP, but for the Mixtral 8x7B model, we prefer EP and TP.

Does grouped_gemm support variable token lengths to local experts on the same rank?

Answer 9 · 2024-04-13T09:35:04.000Z

Does grouped_gemm support variable token lengths to local experts on the same rank?

Yes, we support variable lengths for inputs from each local expert.

Answer 10 · 2024-04-13T14:05:10.000Z

Hi @ShinoharaHare , our env is:

DGX H100, 64 GPUs.

pytorch 24.03 image..

I double-checked your scripts and suggest the following modifications:

Seq Len: 2048 -> 4096

enable dp overlap: --overlap-grad-reduce --overlap-param-gather

Let's see how performance changes after these changes ^ ^.

@yanring
Enabling --overlap-grad-reduce and --overlap-param-gather will result in a CUDA error: uncorrectable ECC error encountered, which seems essentially caused by OOM.
I've tried setting the sequence length to 4096 before, but doing so results in a CUDA OOM directly.
I've also tried adding --recompute-activations in both scenarios, but still get OOM.

Answer 11 · 2024-04-14T07:35:31.000Z

Hi, thanks for the suggestions.
I retested the throuput according to your suggestion.
To be more specific:

Update Megatron-LM the latest commit (ba77325)
Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)
Set --moe-token-dispatcher-type alltoall
Switch to EP=8 & TP=2
Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)

The throughput has indeed increased significantly, reaching around 230 TFLOP/s.
However, for H100, it's still pretty low, isn't it?
May I ask, theoretically, what would be a more reasonable throughput?

Here is the logs

using world size: 16, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:Llama2Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  add_bias_linear ................................. False
  add_position_embedding .......................... True
  add_qkv_bias .................................... False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  apply_rope_fusion ............................... True
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  auto_detect_ckpt_format ......................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ False
  bias_swiglu_fusion .............................. True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  ckpt_fully_parallel_save ........................ False
  ckpt_step ....................................... None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  create_attention_mask_in_dataloader ............. True
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... ['custom/data/wudao/wudao_mistralbpe_content_document']
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  decoupled_lr .................................... None
  decoupled_min_lr ................................ None
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  dist_ckpt_format ................................ torch_dist
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  enable_one_logger ............................... False
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 2048
  end_weight_decay ................................ 0.1
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 1
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 8
  ffn_hidden_size ................................. 14336
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 128
  gradient_accumulation_fusion .................... True
  group_query_attention ........................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.0
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  lazy_mpu_init ................................... None
  load ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 1
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_progress .................................... True
  log_throughput .................................. True
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. 320000
  lr_decay_samples ................................ None
  lr_decay_style .................................. cosine
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 500
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 32768
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 1e-05
  mmap_bin_files .................................. True
  mock_data ....................................... False
  moe_aux_loss_coeff .............................. 0.01
  moe_grouped_gemm ................................ True
  moe_input_jitter_eps ............................ None
  moe_per_layer_logging ........................... False
  moe_router_load_balancing_type .................. aux_loss
  moe_router_topk ................................. 2
  moe_token_dispatcher_type ....................... alltoall
  moe_token_dropping .............................. False
  moe_z_loss_coeff ................................ None
  nccl_communicator_config_path ................... None
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... RMSNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... 8
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 8
  num_workers ..................................... 2
  one_logger_entity ............................... hwinf_dcm
  one_logger_project .............................. e2e-tracking
  one_logger_run_name ............................. None
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... rope
  pretrained_checkpoint ........................... None
  profile ......................................... True
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  qk_layernorm .................................... False
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_attention_gate ............................ 1
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_project_dir ............................... None
  retro_verify_neighbor_count ..................... True
  rotary_interleaved .............................. False
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
  save_interval ................................... 1000
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 99990,8,2
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.1
  swiglu .......................................... True
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 2
  tensorboard_dir ................................. custom/ckpt/mixtral-8x7b-tp2-ep8-mgg/tensorboard
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  test_mode ....................................... False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. custom/ckpt/mixtral-8x7b/tokenizer.model
  tokenizer_type .................................. Llama2Tokenizer
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_ag .............................. True
  tp_comm_overlap_cfg ............................. None
  tp_comm_overlap_rs .............................. True
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... 100
  train_samples ................................... None
  transformer_impl ................................ transformer_engine
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. True
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... None
  use_dist_ckpt ................................... False
  use_distributed_optimizer ....................... True
  use_flash_attn .................................. True
  use_mcore_models ................................ True
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... None
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.1
  weight_decay_incr_style ......................... constant
  world_size ...................................... 16
  yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 16
> building Llama2Tokenizer tokenizer ...
 > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
make: Entering directory '.../Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '.../Megatron-LM/megatron/core/datasets'
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.104 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 7.866 seconds
[rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 14.235
[after megatron is initialized] datetime: 2024-04-06 02:54:57 
building GPT model ...
 > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3622047744
 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3622047744
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (803475456 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.final_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.output_layer.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements):
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.12.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.26.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.20.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.1.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.30.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.18.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.29.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.23.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.5.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.0.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.15.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.3.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.27.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.21.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.14.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.13.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.8.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.31.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.25.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.19.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.24.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.22.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.17.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.11.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.7.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.4.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.28.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.16.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.6.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.2.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.10.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer:    module.decoder.layers.9.mlp.experts.weight1
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=0.0001, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b366837d3f0>)
> learning rate decay style: cosine
 loading release checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
could not find arguments in the checkpoint ...
 checkpoint version 0
 succesfully fixed query-key-values ordering for checkpoint version 0
  successfully loaded checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg [ t 0, p 0 ] at iteration 0
> setting tensorboard ...
(min, max) time across ranks (ms):
    load-checkpoint ................................: (8126.15, 8126.65)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-04-06 02:55:06 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      12800
    validation: 128
    test:       128
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.9999), (0.9999, 0.99998), (0.99998, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from custom/data/wudao/wudao_mistralbpe_content_document.idx
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset:	Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset:	Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 59132211
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 59132211
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 40201537
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from a625518736b8143e22f4f34c6682183e-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from a625518736b8143e22f4f34c6682183e-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from a625518736b8143e22f4f34c6682183e-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6204
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset:	Load the document index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the sample index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset:	Load the shuffle index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2332
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-06 02:55:07 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (8592.94, 8605.02)
    train/valid/test-data-iterators-setup ..........: (569.02, 865.21)
training ...
[before the start of training step] datetime: 2024-04-06 02:55:07 
Number of parameters in transformer layers in billions:  46.44
Number of parameters in embedding layers in billions: 0.26
Total number of parameters in billions: 46.70
Number of parameters in most loaded shard in billions: 23.3510
Theoretical memory footprints: weight and optimizer=167019.40 MB
[Rank 0] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61470.0 | max reserved: 61470.0
 [2024-04-06 02:55:39] iteration        1/     100 | consumed samples:          128 | elapsed time per iteration (ms): 32269.4 | throughput per GPU (TFLOP/s/GPU): 40.5 | learning rate: 2.000000E-07 | global batch size:   128 | lm loss: 1.985617E+00 | load_balancing_loss: 1.089786E+00 | loss scale: 1.0 | grad norm: 6.396 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
[Rank 1] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61480.0 | max reserved: 61480.0
 [2024-04-06 02:55:45] iteration        2/     100 | consumed samples:          256 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 231.9 | learning rate: 4.000000E-07 | global batch size:   128 | lm loss: 2.021530E+00 | load_balancing_loss: 1.087362E+00 | loss scale: 1.0 | grad norm: 6.895 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:55:50] iteration        3/     100 | consumed samples:          384 | elapsed time per iteration (ms): 5410.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 6.000000E-07 | global batch size:   128 | lm loss: 2.003316E+00 | load_balancing_loss: 1.085377E+00 | loss scale: 1.0 | grad norm: 6.603 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:55:55] iteration        4/     100 | consumed samples:          512 | elapsed time per iteration (ms): 5364.1 | throughput per GPU (TFLOP/s/GPU): 243.5 | learning rate: 8.000000E-07 | global batch size:   128 | lm loss: 2.009657E+00 | load_balancing_loss: 1.091695E+00 | loss scale: 1.0 | grad norm: 6.619 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:01] iteration        5/     100 | consumed samples:          640 | elapsed time per iteration (ms): 5496.7 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.000000E-06 | global batch size:   128 | lm loss: 2.002326E+00 | load_balancing_loss: 1.091539E+00 | loss scale: 1.0 | grad norm: 6.612 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:06] iteration        6/     100 | consumed samples:          768 | elapsed time per iteration (ms): 5364.8 | throughput per GPU (TFLOP/s/GPU): 243.4 | learning rate: 1.200000E-06 | global batch size:   128 | lm loss: 1.933151E+00 | load_balancing_loss: 1.086472E+00 | loss scale: 1.0 | grad norm: 5.765 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:12] iteration        7/     100 | consumed samples:          896 | elapsed time per iteration (ms): 5682.7 | throughput per GPU (TFLOP/s/GPU): 229.8 | learning rate: 1.400000E-06 | global batch size:   128 | lm loss: 2.016085E+00 | load_balancing_loss: 1.085193E+00 | loss scale: 1.0 | grad norm: 5.821 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:17] iteration        8/     100 | consumed samples:         1024 | elapsed time per iteration (ms): 5408.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 1.600000E-06 | global batch size:   128 | lm loss: 1.965713E+00 | load_balancing_loss: 1.080933E+00 | loss scale: 1.0 | grad norm: 4.774 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:23] iteration        9/     100 | consumed samples:         1152 | elapsed time per iteration (ms): 5590.1 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.800000E-06 | global batch size:   128 | lm loss: 1.919308E+00 | load_balancing_loss: 1.089582E+00 | loss scale: 1.0 | grad norm: 4.267 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:28] iteration       10/     100 | consumed samples:         1280 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 2.000000E-06 | global batch size:   128 | lm loss: 1.978377E+00 | load_balancing_loss: 1.089948E+00 | loss scale: 1.0 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:34] iteration       11/     100 | consumed samples:         1408 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 218.2 | learning rate: 2.200000E-06 | global batch size:   128 | lm loss: 1.889895E+00 | load_balancing_loss: 1.083618E+00 | loss scale: 1.0 | grad norm: 3.361 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:40] iteration       12/     100 | consumed samples:         1536 | elapsed time per iteration (ms): 5821.8 | throughput per GPU (TFLOP/s/GPU): 224.3 | learning rate: 2.400000E-06 | global batch size:   128 | lm loss: 1.932808E+00 | load_balancing_loss: 1.085315E+00 | loss scale: 1.0 | grad norm: 3.336 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:46] iteration       13/     100 | consumed samples:         1664 | elapsed time per iteration (ms): 5962.2 | throughput per GPU (TFLOP/s/GPU): 219.0 | learning rate: 2.600000E-06 | global batch size:   128 | lm loss: 1.911683E+00 | load_balancing_loss: 1.079515E+00 | loss scale: 1.0 | grad norm: 3.183 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:52] iteration       14/     100 | consumed samples:         1792 | elapsed time per iteration (ms): 5927.4 | throughput per GPU (TFLOP/s/GPU): 220.3 | learning rate: 2.800000E-06 | global batch size:   128 | lm loss: 1.913695E+00 | load_balancing_loss: 1.076165E+00 | loss scale: 1.0 | grad norm: 2.994 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:56:58] iteration       15/     100 | consumed samples:         1920 | elapsed time per iteration (ms): 5926.4 | throughput per GPU (TFLOP/s/GPU): 220.4 | learning rate: 3.000000E-06 | global batch size:   128 | lm loss: 1.957101E+00 | load_balancing_loss: 1.069903E+00 | loss scale: 1.0 | grad norm: 2.853 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:04] iteration       16/     100 | consumed samples:         2048 | elapsed time per iteration (ms): 5912.7 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 3.200000E-06 | global batch size:   128 | lm loss: 1.915763E+00 | load_balancing_loss: 1.065748E+00 | loss scale: 1.0 | grad norm: 2.778 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:10] iteration       17/     100 | consumed samples:         2176 | elapsed time per iteration (ms): 5706.3 | throughput per GPU (TFLOP/s/GPU): 228.9 | learning rate: 3.400000E-06 | global batch size:   128 | lm loss: 1.918353E+00 | load_balancing_loss: 1.064678E+00 | loss scale: 1.0 | grad norm: 2.911 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:15] iteration       18/     100 | consumed samples:         2304 | elapsed time per iteration (ms): 5732.8 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 3.600000E-06 | global batch size:   128 | lm loss: 1.861051E+00 | load_balancing_loss: 1.058054E+00 | loss scale: 1.0 | grad norm: 2.449 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:21] iteration       19/     100 | consumed samples:         2432 | elapsed time per iteration (ms): 5684.9 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 3.800000E-06 | global batch size:   128 | lm loss: 1.934895E+00 | load_balancing_loss: 1.049081E+00 | loss scale: 1.0 | grad norm: 2.447 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:27] iteration       20/     100 | consumed samples:         2560 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 226.3 | learning rate: 4.000000E-06 | global batch size:   128 | lm loss: 1.932632E+00 | load_balancing_loss: 1.052491E+00 | loss scale: 1.0 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:32] iteration       21/     100 | consumed samples:         2688 | elapsed time per iteration (ms): 5541.8 | throughput per GPU (TFLOP/s/GPU): 235.6 | learning rate: 4.200000E-06 | global batch size:   128 | lm loss: 1.904877E+00 | load_balancing_loss: 1.047207E+00 | loss scale: 1.0 | grad norm: 2.213 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:38] iteration       22/     100 | consumed samples:         2816 | elapsed time per iteration (ms): 5576.7 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 4.400000E-06 | global batch size:   128 | lm loss: 1.872380E+00 | load_balancing_loss: 1.039512E+00 | loss scale: 1.0 | grad norm: 2.116 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:44] iteration       23/     100 | consumed samples:         2944 | elapsed time per iteration (ms): 5807.4 | throughput per GPU (TFLOP/s/GPU): 224.9 | learning rate: 4.600000E-06 | global batch size:   128 | lm loss: 1.835408E+00 | load_balancing_loss: 1.042104E+00 | loss scale: 1.0 | grad norm: 2.034 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:50] iteration       24/     100 | consumed samples:         3072 | elapsed time per iteration (ms): 5727.3 | throughput per GPU (TFLOP/s/GPU): 228.0 | learning rate: 4.800000E-06 | global batch size:   128 | lm loss: 1.898657E+00 | load_balancing_loss: 1.029742E+00 | loss scale: 1.0 | grad norm: 1.982 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:57:55] iteration       25/     100 | consumed samples:         3200 | elapsed time per iteration (ms): 5498.4 | throughput per GPU (TFLOP/s/GPU): 237.5 | learning rate: 5.000000E-06 | global batch size:   128 | lm loss: 1.904866E+00 | load_balancing_loss: 1.034888E+00 | loss scale: 1.0 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:01] iteration       26/     100 | consumed samples:         3328 | elapsed time per iteration (ms): 5531.7 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 5.200000E-06 | global batch size:   128 | lm loss: 1.889752E+00 | load_balancing_loss: 1.028931E+00 | loss scale: 1.0 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:06] iteration       27/     100 | consumed samples:         3456 | elapsed time per iteration (ms): 5678.3 | throughput per GPU (TFLOP/s/GPU): 230.0 | learning rate: 5.400000E-06 | global batch size:   128 | lm loss: 1.866109E+00 | load_balancing_loss: 1.031736E+00 | loss scale: 1.0 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:12] iteration       28/     100 | consumed samples:         3584 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 5.600000E-06 | global batch size:   128 | lm loss: 1.914117E+00 | load_balancing_loss: 1.027364E+00 | loss scale: 1.0 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:18] iteration       29/     100 | consumed samples:         3712 | elapsed time per iteration (ms): 5912.1 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 5.800000E-06 | global batch size:   128 | lm loss: 1.867856E+00 | load_balancing_loss: 1.023825E+00 | loss scale: 1.0 | grad norm: 1.769 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:23] iteration       30/     100 | consumed samples:         3840 | elapsed time per iteration (ms): 5571.1 | throughput per GPU (TFLOP/s/GPU): 234.4 | learning rate: 6.000000E-06 | global batch size:   128 | lm loss: 1.924535E+00 | load_balancing_loss: 1.025294E+00 | loss scale: 1.0 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:29] iteration       31/     100 | consumed samples:         3968 | elapsed time per iteration (ms): 5718.9 | throughput per GPU (TFLOP/s/GPU): 228.3 | learning rate: 6.200000E-06 | global batch size:   128 | lm loss: 1.830754E+00 | load_balancing_loss: 1.028048E+00 | loss scale: 1.0 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:35] iteration       32/     100 | consumed samples:         4096 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 232.0 | learning rate: 6.400000E-06 | global batch size:   128 | lm loss: 1.848776E+00 | load_balancing_loss: 1.021549E+00 | loss scale: 1.0 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:40] iteration       33/     100 | consumed samples:         4224 | elapsed time per iteration (ms): 5600.4 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 6.600000E-06 | global batch size:   128 | lm loss: 1.917658E+00 | load_balancing_loss: 1.032319E+00 | loss scale: 1.0 | grad norm: 1.519 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:46] iteration       34/     100 | consumed samples:         4352 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 231.4 | learning rate: 6.800000E-06 | global batch size:   128 | lm loss: 1.844636E+00 | load_balancing_loss: 1.019185E+00 | loss scale: 1.0 | grad norm: 1.626 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:51] iteration       35/     100 | consumed samples:         4480 | elapsed time per iteration (ms): 5367.8 | throughput per GPU (TFLOP/s/GPU): 243.3 | learning rate: 7.000000E-06 | global batch size:   128 | lm loss: 1.853418E+00 | load_balancing_loss: 1.020990E+00 | loss scale: 1.0 | grad norm: 1.760 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:58:57] iteration       36/     100 | consumed samples:         4608 | elapsed time per iteration (ms): 5399.9 | throughput per GPU (TFLOP/s/GPU): 241.8 | learning rate: 7.200000E-06 | global batch size:   128 | lm loss: 1.842918E+00 | load_balancing_loss: 1.023077E+00 | loss scale: 1.0 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:02] iteration       37/     100 | consumed samples:         4736 | elapsed time per iteration (ms): 5515.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 7.400000E-06 | global batch size:   128 | lm loss: 1.862270E+00 | load_balancing_loss: 1.023782E+00 | loss scale: 1.0 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:08] iteration       38/     100 | consumed samples:         4864 | elapsed time per iteration (ms): 5477.8 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 7.600000E-06 | global batch size:   128 | lm loss: 1.862543E+00 | load_balancing_loss: 1.019304E+00 | loss scale: 1.0 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:13] iteration       39/     100 | consumed samples:         4992 | elapsed time per iteration (ms): 5649.1 | throughput per GPU (TFLOP/s/GPU): 231.2 | learning rate: 7.800000E-06 | global batch size:   128 | lm loss: 1.863421E+00 | load_balancing_loss: 1.017805E+00 | loss scale: 1.0 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:19] iteration       40/     100 | consumed samples:         5120 | elapsed time per iteration (ms): 5810.4 | throughput per GPU (TFLOP/s/GPU): 224.8 | learning rate: 8.000000E-06 | global batch size:   128 | lm loss: 1.879655E+00 | load_balancing_loss: 1.017568E+00 | loss scale: 1.0 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:25] iteration       41/     100 | consumed samples:         5248 | elapsed time per iteration (ms): 5462.9 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 8.200000E-06 | global batch size:   128 | lm loss: 1.812076E+00 | load_balancing_loss: 1.020508E+00 | loss scale: 1.0 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:30] iteration       42/     100 | consumed samples:         5376 | elapsed time per iteration (ms): 5452.3 | throughput per GPU (TFLOP/s/GPU): 239.5 | learning rate: 8.400000E-06 | global batch size:   128 | lm loss: 1.824542E+00 | load_balancing_loss: 1.017472E+00 | loss scale: 1.0 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:36] iteration       43/     100 | consumed samples:         5504 | elapsed time per iteration (ms): 5444.9 | throughput per GPU (TFLOP/s/GPU): 239.8 | learning rate: 8.600000E-06 | global batch size:   128 | lm loss: 1.825991E+00 | load_balancing_loss: 1.019746E+00 | loss scale: 1.0 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:41] iteration       44/     100 | consumed samples:         5632 | elapsed time per iteration (ms): 5533.8 | throughput per GPU (TFLOP/s/GPU): 236.0 | learning rate: 8.800000E-06 | global batch size:   128 | lm loss: 1.875063E+00 | load_balancing_loss: 1.020033E+00 | loss scale: 1.0 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:47] iteration       45/     100 | consumed samples:         5760 | elapsed time per iteration (ms): 5718.6 | throughput per GPU (TFLOP/s/GPU): 228.4 | learning rate: 9.000000E-06 | global batch size:   128 | lm loss: 1.834162E+00 | load_balancing_loss: 1.018004E+00 | loss scale: 1.0 | grad norm: 1.611 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:52] iteration       46/     100 | consumed samples:         5888 | elapsed time per iteration (ms): 5567.2 | throughput per GPU (TFLOP/s/GPU): 234.6 | learning rate: 9.200000E-06 | global batch size:   128 | lm loss: 1.883577E+00 | load_balancing_loss: 1.016062E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 02:59:58] iteration       47/     100 | consumed samples:         6016 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 9.400000E-06 | global batch size:   128 | lm loss: 1.836727E+00 | load_balancing_loss: 1.019520E+00 | loss scale: 1.0 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:04] iteration       48/     100 | consumed samples:         6144 | elapsed time per iteration (ms): 5872.4 | throughput per GPU (TFLOP/s/GPU): 222.4 | learning rate: 9.600000E-06 | global batch size:   128 | lm loss: 1.855191E+00 | load_balancing_loss: 1.017754E+00 | loss scale: 1.0 | grad norm: 1.508 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:09] iteration       49/     100 | consumed samples:         6272 | elapsed time per iteration (ms): 5528.7 | throughput per GPU (TFLOP/s/GPU): 236.2 | learning rate: 9.800000E-06 | global batch size:   128 | lm loss: 1.806294E+00 | load_balancing_loss: 1.017504E+00 | loss scale: 1.0 | grad norm: 1.529 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:15] iteration       50/     100 | consumed samples:         6400 | elapsed time per iteration (ms): 5531.5 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 1.000000E-05 | global batch size:   128 | lm loss: 1.887587E+00 | load_balancing_loss: 1.016094E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:21] iteration       51/     100 | consumed samples:         6528 | elapsed time per iteration (ms): 5501.3 | throughput per GPU (TFLOP/s/GPU): 237.4 | learning rate: 1.020000E-05 | global batch size:   128 | lm loss: 1.834414E+00 | load_balancing_loss: 1.015084E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:26] iteration       52/     100 | consumed samples:         6656 | elapsed time per iteration (ms): 5520.9 | throughput per GPU (TFLOP/s/GPU): 236.5 | learning rate: 1.040000E-05 | global batch size:   128 | lm loss: 1.847078E+00 | load_balancing_loss: 1.015950E+00 | loss scale: 1.0 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:32] iteration       53/     100 | consumed samples:         6784 | elapsed time per iteration (ms): 5711.6 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.060000E-05 | global batch size:   128 | lm loss: 1.862840E+00 | load_balancing_loss: 1.016317E+00 | loss scale: 1.0 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:37] iteration       54/     100 | consumed samples:         6912 | elapsed time per iteration (ms): 5689.4 | throughput per GPU (TFLOP/s/GPU): 229.5 | learning rate: 1.080000E-05 | global batch size:   128 | lm loss: 1.897956E+00 | load_balancing_loss: 1.017408E+00 | loss scale: 1.0 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:43] iteration       55/     100 | consumed samples:         7040 | elapsed time per iteration (ms): 5763.8 | throughput per GPU (TFLOP/s/GPU): 226.6 | learning rate: 1.100000E-05 | global batch size:   128 | lm loss: 1.863309E+00 | load_balancing_loss: 1.014457E+00 | loss scale: 1.0 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:49] iteration       56/     100 | consumed samples:         7168 | elapsed time per iteration (ms): 5742.1 | throughput per GPU (TFLOP/s/GPU): 227.4 | learning rate: 1.120000E-05 | global batch size:   128 | lm loss: 1.899538E+00 | load_balancing_loss: 1.018558E+00 | loss scale: 1.0 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:00:54] iteration       57/     100 | consumed samples:         7296 | elapsed time per iteration (ms): 5450.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.140000E-05 | global batch size:   128 | lm loss: 1.864605E+00 | load_balancing_loss: 1.015150E+00 | loss scale: 1.0 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:00] iteration       58/     100 | consumed samples:         7424 | elapsed time per iteration (ms): 5538.9 | throughput per GPU (TFLOP/s/GPU): 235.8 | learning rate: 1.160000E-05 | global batch size:   128 | lm loss: 1.812579E+00 | load_balancing_loss: 1.020851E+00 | loss scale: 1.0 | grad norm: 1.610 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:05] iteration       59/     100 | consumed samples:         7552 | elapsed time per iteration (ms): 5410.9 | throughput per GPU (TFLOP/s/GPU): 241.3 | learning rate: 1.180000E-05 | global batch size:   128 | lm loss: 1.848337E+00 | load_balancing_loss: 1.013638E+00 | loss scale: 1.0 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:11] iteration       60/     100 | consumed samples:         7680 | elapsed time per iteration (ms): 5603.1 | throughput per GPU (TFLOP/s/GPU): 233.1 | learning rate: 1.200000E-05 | global batch size:   128 | lm loss: 1.801180E+00 | load_balancing_loss: 1.019084E+00 | loss scale: 1.0 | grad norm: 1.549 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:16] iteration       61/     100 | consumed samples:         7808 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.220000E-05 | global batch size:   128 | lm loss: 1.813972E+00 | load_balancing_loss: 1.014779E+00 | loss scale: 1.0 | grad norm: 1.427 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:22] iteration       62/     100 | consumed samples:         7936 | elapsed time per iteration (ms): 5753.1 | throughput per GPU (TFLOP/s/GPU): 227.0 | learning rate: 1.240000E-05 | global batch size:   128 | lm loss: 1.808689E+00 | load_balancing_loss: 1.022012E+00 | loss scale: 1.0 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:28] iteration       63/     100 | consumed samples:         8064 | elapsed time per iteration (ms): 5650.1 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 1.260000E-05 | global batch size:   128 | lm loss: 1.781526E+00 | load_balancing_loss: 1.013716E+00 | loss scale: 1.0 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:33] iteration       64/     100 | consumed samples:         8192 | elapsed time per iteration (ms): 5539.7 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.280000E-05 | global batch size:   128 | lm loss: 1.871476E+00 | load_balancing_loss: 1.019044E+00 | loss scale: 1.0 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:39] iteration       65/     100 | consumed samples:         8320 | elapsed time per iteration (ms): 5493.9 | throughput per GPU (TFLOP/s/GPU): 237.7 | learning rate: 1.300000E-05 | global batch size:   128 | lm loss: 1.846450E+00 | load_balancing_loss: 1.017387E+00 | loss scale: 1.0 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:44] iteration       66/     100 | consumed samples:         8448 | elapsed time per iteration (ms): 5590.8 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.320000E-05 | global batch size:   128 | lm loss: 1.873755E+00 | load_balancing_loss: 1.014257E+00 | loss scale: 1.0 | grad norm: 1.411 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:50] iteration       67/     100 | consumed samples:         8576 | elapsed time per iteration (ms): 5710.3 | throughput per GPU (TFLOP/s/GPU): 228.7 | learning rate: 1.340000E-05 | global batch size:   128 | lm loss: 1.765591E+00 | load_balancing_loss: 1.016482E+00 | loss scale: 1.0 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:01:56] iteration       68/     100 | consumed samples:         8704 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 227.7 | learning rate: 1.360000E-05 | global batch size:   128 | lm loss: 1.839895E+00 | load_balancing_loss: 1.012786E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:01] iteration       69/     100 | consumed samples:         8832 | elapsed time per iteration (ms): 5478.6 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 1.380000E-05 | global batch size:   128 | lm loss: 1.912256E+00 | load_balancing_loss: 1.013041E+00 | loss scale: 1.0 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:07] iteration       70/     100 | consumed samples:         8960 | elapsed time per iteration (ms): 5514.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 1.400000E-05 | global batch size:   128 | lm loss: 1.873068E+00 | load_balancing_loss: 1.012509E+00 | loss scale: 1.0 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:12] iteration       71/     100 | consumed samples:         9088 | elapsed time per iteration (ms): 5361.6 | throughput per GPU (TFLOP/s/GPU): 243.6 | learning rate: 1.420000E-05 | global batch size:   128 | lm loss: 1.818812E+00 | load_balancing_loss: 1.013377E+00 | loss scale: 1.0 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:18] iteration       72/     100 | consumed samples:         9216 | elapsed time per iteration (ms): 5470.7 | throughput per GPU (TFLOP/s/GPU): 238.7 | learning rate: 1.440000E-05 | global batch size:   128 | lm loss: 1.820313E+00 | load_balancing_loss: 1.019612E+00 | loss scale: 1.0 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:24] iteration       73/     100 | consumed samples:         9344 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 224.0 | learning rate: 1.460000E-05 | global batch size:   128 | lm loss: 1.798953E+00 | load_balancing_loss: 1.010977E+00 | loss scale: 1.0 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:29] iteration       74/     100 | consumed samples:         9472 | elapsed time per iteration (ms): 5702.4 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.480000E-05 | global batch size:   128 | lm loss: 1.774078E+00 | load_balancing_loss: 1.012441E+00 | loss scale: 1.0 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:35] iteration       75/     100 | consumed samples:         9600 | elapsed time per iteration (ms): 5599.5 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.500000E-05 | global batch size:   128 | lm loss: 1.838492E+00 | load_balancing_loss: 1.015038E+00 | loss scale: 1.0 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:40] iteration       76/     100 | consumed samples:         9728 | elapsed time per iteration (ms): 5588.2 | throughput per GPU (TFLOP/s/GPU): 233.7 | learning rate: 1.520000E-05 | global batch size:   128 | lm loss: 1.860703E+00 | load_balancing_loss: 1.012689E+00 | loss scale: 1.0 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:46] iteration       77/     100 | consumed samples:         9856 | elapsed time per iteration (ms): 5425.4 | throughput per GPU (TFLOP/s/GPU): 240.7 | learning rate: 1.540000E-05 | global batch size:   128 | lm loss: 1.827507E+00 | load_balancing_loss: 1.012502E+00 | loss scale: 1.0 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:52] iteration       78/     100 | consumed samples:         9984 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 231.0 | learning rate: 1.560000E-05 | global batch size:   128 | lm loss: 1.784492E+00 | load_balancing_loss: 1.013809E+00 | loss scale: 1.0 | grad norm: 1.407 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:02:57] iteration       79/     100 | consumed samples:        10112 | elapsed time per iteration (ms): 5577.0 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 1.580000E-05 | global batch size:   128 | lm loss: 1.858489E+00 | load_balancing_loss: 1.011662E+00 | loss scale: 1.0 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:03] iteration       80/     100 | consumed samples:        10240 | elapsed time per iteration (ms): 5712.8 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.600000E-05 | global batch size:   128 | lm loss: 1.842588E+00 | load_balancing_loss: 1.011640E+00 | loss scale: 1.0 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:09] iteration       81/     100 | consumed samples:        10368 | elapsed time per iteration (ms): 5684.5 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 1.620000E-05 | global batch size:   128 | lm loss: 1.818980E+00 | load_balancing_loss: 1.012697E+00 | loss scale: 1.0 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:14] iteration       82/     100 | consumed samples:        10496 | elapsed time per iteration (ms): 5592.0 | throughput per GPU (TFLOP/s/GPU): 233.5 | learning rate: 1.640000E-05 | global batch size:   128 | lm loss: 1.805010E+00 | load_balancing_loss: 1.012805E+00 | loss scale: 1.0 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:20] iteration       83/     100 | consumed samples:        10624 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 231.5 | learning rate: 1.660000E-05 | global batch size:   128 | lm loss: 1.812314E+00 | load_balancing_loss: 1.011967E+00 | loss scale: 1.0 | grad norm: 1.530 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:25] iteration       84/     100 | consumed samples:        10752 | elapsed time per iteration (ms): 5563.7 | throughput per GPU (TFLOP/s/GPU): 234.7 | learning rate: 1.680000E-05 | global batch size:   128 | lm loss: 1.822110E+00 | load_balancing_loss: 1.009684E+00 | loss scale: 1.0 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:31] iteration       85/     100 | consumed samples:        10880 | elapsed time per iteration (ms): 5580.9 | throughput per GPU (TFLOP/s/GPU): 234.0 | learning rate: 1.700000E-05 | global batch size:   128 | lm loss: 1.831795E+00 | load_balancing_loss: 1.009344E+00 | loss scale: 1.0 | grad norm: 1.578 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:37] iteration       86/     100 | consumed samples:        11008 | elapsed time per iteration (ms): 5695.8 | throughput per GPU (TFLOP/s/GPU): 229.3 | learning rate: 1.720000E-05 | global batch size:   128 | lm loss: 1.831625E+00 | load_balancing_loss: 1.011533E+00 | loss scale: 1.0 | grad norm: 1.515 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:42] iteration       87/     100 | consumed samples:        11136 | elapsed time per iteration (ms): 5444.5 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.740000E-05 | global batch size:   128 | lm loss: 1.814374E+00 | load_balancing_loss: 1.010052E+00 | loss scale: 1.0 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:48] iteration       88/     100 | consumed samples:        11264 | elapsed time per iteration (ms): 5462.7 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 1.760000E-05 | global batch size:   128 | lm loss: 1.825778E+00 | load_balancing_loss: 1.010838E+00 | loss scale: 1.0 | grad norm: 1.506 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:53] iteration       89/     100 | consumed samples:        11392 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 231.8 | learning rate: 1.780000E-05 | global batch size:   128 | lm loss: 1.818898E+00 | load_balancing_loss: 1.011014E+00 | loss scale: 1.0 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:03:59] iteration       90/     100 | consumed samples:        11520 | elapsed time per iteration (ms): 5567.8 | throughput per GPU (TFLOP/s/GPU): 234.5 | learning rate: 1.800000E-05 | global batch size:   128 | lm loss: 1.813602E+00 | load_balancing_loss: 1.022434E+00 | loss scale: 1.0 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:04] iteration       91/     100 | consumed samples:        11648 | elapsed time per iteration (ms): 5691.9 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 1.820000E-05 | global batch size:   128 | lm loss: 1.797111E+00 | load_balancing_loss: 1.011964E+00 | loss scale: 1.0 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:10] iteration       92/     100 | consumed samples:        11776 | elapsed time per iteration (ms): 5451.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.840000E-05 | global batch size:   128 | lm loss: 1.809117E+00 | load_balancing_loss: 1.012038E+00 | loss scale: 1.0 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:15] iteration       93/     100 | consumed samples:        11904 | elapsed time per iteration (ms): 5599.2 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.860000E-05 | global batch size:   128 | lm loss: 1.797812E+00 | load_balancing_loss: 1.011838E+00 | loss scale: 1.0 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:21] iteration       94/     100 | consumed samples:        12032 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.880000E-05 | global batch size:   128 | lm loss: 1.865515E+00 | load_balancing_loss: 1.013109E+00 | loss scale: 1.0 | grad norm: 1.603 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:26] iteration       95/     100 | consumed samples:        12160 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.900000E-05 | global batch size:   128 | lm loss: 1.845348E+00 | load_balancing_loss: 1.012796E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:32] iteration       96/     100 | consumed samples:        12288 | elapsed time per iteration (ms): 5702.2 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.920000E-05 | global batch size:   128 | lm loss: 1.843516E+00 | load_balancing_loss: 1.010116E+00 | loss scale: 1.0 | grad norm: 1.851 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:38] iteration       97/     100 | consumed samples:        12416 | elapsed time per iteration (ms): 5733.2 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 1.940000E-05 | global batch size:   128 | lm loss: 1.876754E+00 | load_balancing_loss: 1.011542E+00 | loss scale: 1.0 | grad norm: 1.748 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:43] iteration       98/     100 | consumed samples:        12544 | elapsed time per iteration (ms): 5556.4 | throughput per GPU (TFLOP/s/GPU): 235.0 | learning rate: 1.960000E-05 | global batch size:   128 | lm loss: 1.810738E+00 | load_balancing_loss: 1.010371E+00 | loss scale: 1.0 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:49] iteration       99/     100 | consumed samples:        12672 | elapsed time per iteration (ms): 5523.5 | throughput per GPU (TFLOP/s/GPU): 236.4 | learning rate: 1.980000E-05 | global batch size:   128 | lm loss: 1.872008E+00 | load_balancing_loss: 1.008882E+00 | loss scale: 1.0 | grad norm: 1.681 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2024-04-06 03:04:54] iteration      100/     100 | consumed samples:        12800 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 2.000000E-05 | global batch size:   128 | lm loss: 1.824753E+00 | load_balancing_loss: 1.009905E+00 | loss scale: 1.0 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
[after training is done] datetime: 2024-04-06 03:04:54

which modification brings the most speed improvement?
btw I encountered some error when converting mixtral from transformers to Megatron when grouped-gemm is set, can you share some converting scripts?

Answer 12 · 2024-04-22T12:01:13.000Z

@ShinoharaHare Could you please share your checkpoint conversion script?

Answer 13 · 2024-04-22T14:46:16.000Z

which modification brings the most speed improvement? btw I encountered some error when converting mixtral from transformers to Megatron when grouped-gemm is set, can you share some converting scripts?

The most significant performance change is achieved by resuming from a trained checkpoint. If you do not have pretrained weights, you can train from scratch for about 500 steps. We noticed that after several hundred steps, the token distribution will become quite balanced.

Answer 14 · 2024-04-22T17:30:03.000Z

@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?

Answer 15 · 2024-04-24T03:43:53.000Z

@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?

Hi Vlad, we are working on the converter; it is already in the review process.

Answer 16 · 2024-05-23T04:00:27.000Z

@yanring @ShinoharaHare

Hi, I'm in a similar situation to this issue. But we also have some differences. For example, we use 8 h800, 64 experts, ep=8, tp=1, pp=1. I also encountered some training efficiency issues, but they were not a top priority.

What bothers me now is that after I used ep8 and grouped-gemm, my model structure changed.
when I try to merge the model with ep=8 into the model with ep=1, it can be loaded by the inference program normally, indicating that the merged shape is correct.

But the inference result is incorrect. I want to know if Megatron-LM will develop a model convert tool that can facilitate me to merge the ep=8 model into the ep=1 model.

Or could you provide some information on how to merge a grouped-gemm enabled model?

Answer 17 · 2024-05-23T06:02:47.000Z

Hello @hwdef , thank you for the update. Currently, the format for the weights in GroupedGEMM for each expert is [input_size, output_size], which is different from the format used in SequentialMLP's ParallelLinear, [output_size, input_size]. Did you transpose the weight during your conversion? @cb521 can help to take a look if this issue continues.

By the way, we are also working on supporting distributed checkpointing with Grouped GEMM.

Answer 18 · 2024-05-23T06:28:41.000Z

By the way, we are also working on supporting distributed checkpointing with Grouped GEMM.

Yes, we have considered the order of output_size and input_size

Answer 19 · 2024-05-23T07:52:25.000Z

@yanring
Hi, Could you please help me check my convert tool?

Answer 20 · 2024-05-28T06:52:56.000Z

@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?

Hi Vlad, we are working on the converter; it is already in the review process.

I’m excited about this. When do you plan to merge it into the main branch?

Answer 21 · 2024-05-28T08:00:17.000Z

@yanring Hi, Could you please help me check my convert tool?

@hwdef 你好，我也遇到同样的问题，请问现在有解决方法了吗？

Answer 22 · 2024-05-28T09:36:30.000Z

@yanring Hi, Could you please help me check my convert tool?

@hwdef 你好，我也遇到同样的问题，请问现在有解决方法了吗？

没有