[QUESTION] Training Mixtral 8x7B on 16 x H100 only achieves low throughput of 130 TFLOPS
Opened this issue · 22 comments
As the title says, I wonder if this is normal.
If not, how should I optimize it?
Logs
using world size: 16, data-parallel size: 4, context-parallel size: 1 tensor-model-parallel size: 4, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:Llama2Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... False
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.0
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
check_for_nan_in_loss_and_grad .................. True
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
create_attention_mask_in_dataloader ............. True
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 4
data_path ....................................... []
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
decoder_num_layers .............................. None
decoder_seq_length .............................. None
delay_grad_reduce ............................... True
delay_param_gather .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
dist_ckpt_format ................................ torch_dist
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_one_logger ............................... False
encoder_num_layers .............................. 32
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_model_parallel_size ...................... 4
ffn_hidden_size ................................. 14336
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 128
gradient_accumulation_fusion .................... True
group_query_attention ........................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.0
hidden_size ..................................... 4096
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
lazy_mpu_init ................................... None
load ............................................ custom/ckpt/mixtral-8x7b
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... True
log_throughput .................................. True
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0001
lr_decay_iters .................................. 320000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 500
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 32768
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_bin_files .................................. True
mock_data ....................................... True
moe_aux_loss_coeff .............................. 0.01
moe_grouped_gemm ................................ True
moe_input_jitter_eps ............................ None
moe_router_load_balancing_type .................. aux_loss
moe_router_topk ................................. 2
moe_token_dropping .............................. False
moe_z_loss_coeff ................................ None
nccl_communicator_config_path ................... None
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
norm_epsilon .................................... 1e-05
normalization ................................... RMSNorm
num_attention_heads ............................. 32
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... 8
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 8
num_workers ..................................... 2
one_logger_entity ............................... hwinf_dcm
one_logger_project .............................. e2e-tracking
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... rope
profile ......................................... True
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
qk_layernorm .................................... False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ None
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
sample_rate ..................................... 1.0
save ............................................ custom/ckpt/mixtral-8x7b
save_interval ................................... 10000
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
spec ............................................ None
split ........................................... 99990,8,2
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
swiglu .......................................... True
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. custom/ckpt/mixtral-8x7b/tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. tokenizer.model
tokenizer_type .................................. Llama2Tokenizer
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_cfg ............................. None
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 500000
train_samples ................................... None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... True
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_gpu_initialization .......................... None
use_mcore_models ................................ True
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_exp_name .................................. mixtral-8x7b
wandb_project ................................... megatron
wandb_save_dir ..................................
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 16
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 32
> building Llama2Tokenizer tokenizer ...
> padded vocab (size: 32000) with 256 dummy tokens (new size: 32256)
> initializing torch distributed ...
> initialized tensor model parallel with size 4
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.087 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 7.672 seconds
[rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 16.718
[after megatron is initialized] datetime: 2024-03-30 19:59:35
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 3221491712
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3221491712
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 3221491712
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3221491712
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (402919424 elements):
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.final_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.output_layer.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements):
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight1
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(fp16=False, bf16=True, params_dtype=torch.bfloat16, optimizer='adam', lr=0.0001, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b1d5617b280>)
> learning rate decay style: cosine
WARNING: could not find the metadata file custom/ckpt/mixtral-8x7b/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
> setting tensorboard ...
(min, max) time across ranks (ms):
load-checkpoint ................................: (0.65, 0.93)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-03-30 20:00:43
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 64000000
validation: 641280
test: 1280
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = True
> building train, validation, and test datasets for GPT ...
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-03-30 20:00:43
done with setup ...
training ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (67432.26, 67517.36)
train/valid/test-data-iterators-setup ..........: (3.74, 338.85)
[before the start of training step] datetime: 2024-03-30 20:00:43
[Rank 0] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62292.0 | max reserved: 62292.0
[Rank 1] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62292.0 | max reserved: 62292.0
[2024-03-30 20:01:13] iteration 1/ 500000 | consumed samples: 128 | elapsed time per iteration (ms): 29428.3 | throughput per GPU (TFLOP/s/GPU): 44.4 | learning rate: 2.000E-07 | global batch size: 128 | lm loss: 1.038043E+01 | loss scale: 1.0 | grad norm: 526.452 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 3] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62294.0 | max reserved: 62294.0
[Rank 2] (after 1 iterations) memory (MB) | allocated: 51955.275390625 | max allocated: 51955.291015625 | reserved: 62294.0 | max reserved: 62294.0
[2024-03-30 20:01:22] iteration 2/ 500000 | consumed samples: 256 | elapsed time per iteration (ms): 9845.6 | throughput per GPU (TFLOP/s/GPU): 132.6 | learning rate: 4.000E-07 | global batch size: 128 | lm loss: 1.047649E+01 | loss scale: 1.0 | grad norm: 506.118 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:01:32] iteration 3/ 500000 | consumed samples: 384 | elapsed time per iteration (ms): 9638.2 | throughput per GPU (TFLOP/s/GPU): 135.5 | learning rate: 6.000E-07 | global batch size: 128 | lm loss: 1.027612E+01 | loss scale: 1.0 | grad norm: 519.891 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:01:42] iteration 4/ 500000 | consumed samples: 512 | elapsed time per iteration (ms): 9702.8 | throughput per GPU (TFLOP/s/GPU): 134.6 | learning rate: 8.000E-07 | global batch size: 128 | lm loss: 9.807467E+00 | loss scale: 1.0 | grad norm: 517.413 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:01:51] iteration 5/ 500000 | consumed samples: 640 | elapsed time per iteration (ms): 9683.9 | throughput per GPU (TFLOP/s/GPU): 134.9 | learning rate: 1.000E-06 | global batch size: 128 | lm loss: 7.764119E+00 | loss scale: 1.0 | grad norm: 492.510 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:01] iteration 6/ 500000 | consumed samples: 768 | elapsed time per iteration (ms): 9675.6 | throughput per GPU (TFLOP/s/GPU): 135.0 | learning rate: 1.200E-06 | global batch size: 128 | lm loss: 2.630678E+00 | loss scale: 1.0 | grad norm: 323.002 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:11] iteration 7/ 500000 | consumed samples: 896 | elapsed time per iteration (ms): 9454.1 | throughput per GPU (TFLOP/s/GPU): 138.1 | learning rate: 1.400E-06 | global batch size: 128 | lm loss: 1.398795E+00 | loss scale: 1.0 | grad norm: 213.771 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:20] iteration 8/ 500000 | consumed samples: 1024 | elapsed time per iteration (ms): 9471.4 | throughput per GPU (TFLOP/s/GPU): 137.9 | learning rate: 1.600E-06 | global batch size: 128 | lm loss: 1.726107E+00 | loss scale: 1.0 | grad norm: 420.698 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:30] iteration 9/ 500000 | consumed samples: 1152 | elapsed time per iteration (ms): 10085.2 | throughput per GPU (TFLOP/s/GPU): 129.5 | learning rate: 1.800E-06 | global batch size: 128 | lm loss: 2.890289E-01 | loss scale: 1.0 | grad norm: 83.644 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:40] iteration 10/ 500000 | consumed samples: 1280 | elapsed time per iteration (ms): 9496.4 | throughput per GPU (TFLOP/s/GPU): 137.5 | learning rate: 2.000E-06 | global batch size: 128 | lm loss: 2.092005E-01 | loss scale: 1.0 | grad norm: 51.010 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:02:50] iteration 11/ 500000 | consumed samples: 1408 | elapsed time per iteration (ms): 10036.9 | throughput per GPU (TFLOP/s/GPU): 130.1 | learning rate: 2.200E-06 | global batch size: 128 | lm loss: 2.352597E-01 | loss scale: 1.0 | grad norm: 106.730 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:00] iteration 12/ 500000 | consumed samples: 1536 | elapsed time per iteration (ms): 10198.4 | throughput per GPU (TFLOP/s/GPU): 128.1 | learning rate: 2.400E-06 | global batch size: 128 | lm loss: 7.243721E-01 | loss scale: 1.0 | grad norm: 163.466 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:10] iteration 13/ 500000 | consumed samples: 1664 | elapsed time per iteration (ms): 10269.3 | throughput per GPU (TFLOP/s/GPU): 127.2 | learning rate: 2.600E-06 | global batch size: 128 | lm loss: 1.757669E+00 | loss scale: 1.0 | grad norm: 356.809 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:21] iteration 14/ 500000 | consumed samples: 1792 | elapsed time per iteration (ms): 10330.7 | throughput per GPU (TFLOP/s/GPU): 126.4 | learning rate: 2.800E-06 | global batch size: 128 | lm loss: 2.853365E-01 | loss scale: 1.0 | grad norm: 93.354 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:31] iteration 15/ 500000 | consumed samples: 1920 | elapsed time per iteration (ms): 10106.2 | throughput per GPU (TFLOP/s/GPU): 129.2 | learning rate: 3.000E-06 | global batch size: 128 | lm loss: 5.018836E-01 | loss scale: 1.0 | grad norm: 165.646 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:41] iteration 16/ 500000 | consumed samples: 2048 | elapsed time per iteration (ms): 10102.3 | throughput per GPU (TFLOP/s/GPU): 129.3 | learning rate: 3.200E-06 | global batch size: 128 | lm loss: 9.302688E-01 | loss scale: 1.0 | grad norm: 170.065 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-03-30 20:03:51] iteration 17/ 500000 | consumed samples: 2176 | elapsed time per iteration (ms): 9946.1 | throughput per GPU (TFLOP/s/GPU): 131.3 | learning rate: 3.400E-06 | global batch size: 128 | lm loss: 8.015128E-02 | loss scale: 1.0 | grad norm: 47.780 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100.
I quickly reviewed your script and have some suggestions:
- Update the code to the latest main branch and upgrade grouped_gemm to v1.0.
- Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.
- Use EP8TP2.
- Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.
Hi, thanks for the suggestions.
I retested the throuput according to your suggestion.
To be more specific:
- Update Megatron-LM the latest commit (ba77325)
- Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)
- Set
--moe-token-dispatcher-type alltoall
- Switch to EP=8 & TP=2
- Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)
The throughput has indeed increased significantly, reaching around 230 TFLOP/s.
However, for H100, it's still pretty low, isn't it?
May I ask, theoretically, what would be a more reasonable throughput?
Here is the logs
using world size: 16, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:Llama2Tokenizer
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
add_bias_linear ................................. False
add_position_embedding .......................... True
add_qkv_bias .................................... False
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_layernorm_1p .............................. False
apply_query_key_layer_scaling ................... False
apply_residual_connection_post_layernorm ........ False
apply_rope_fusion ............................... True
async_tensor_model_parallel_allreduce ........... False
attention_dropout ............................... 0.0
attention_softmax_in_fp32 ....................... False
auto_detect_ckpt_format ......................... False
barrier_with_L1_time ............................ True
bert_binary_head ................................ True
bert_embedder_type .............................. megatron
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ False
bias_swiglu_fusion .............................. True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
check_for_nan_in_loss_and_grad .................. True
ckpt_fully_parallel_save ........................ False
ckpt_step ....................................... None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
clone_scatter_output_in_embedding ............... True
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
context_parallel_size ........................... 1
create_attention_mask_in_dataloader ............. True
data_cache_path ................................. None
data_parallel_random_init ....................... False
data_parallel_size .............................. 8
data_path ....................................... ['custom/data/wudao/wudao_mistralbpe_content_document']
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
decoder_num_layers .............................. None
decoder_seq_length .............................. None
decoupled_lr .................................... None
decoupled_min_lr ................................ None
delay_grad_reduce ............................... True
delay_param_gather .............................. False
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
dist_ckpt_format ................................ torch_dist
distribute_saved_activations .................... False
distributed_backend ............................. nccl
distributed_timeout_minutes ..................... 10
embedding_path .................................. None
empty_unused_memory_level ....................... 0
enable_one_logger ............................... False
encoder_num_layers .............................. 32
encoder_seq_length .............................. 2048
end_weight_decay ................................ 0.1
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 1
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_on_missing_checkpoint ...................... False
exit_signal_handler ............................. False
expert_model_parallel_size ...................... 8
ffn_hidden_size ................................. 14336
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8 ............................................. None
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 128
gradient_accumulation_fusion .................... True
group_query_attention ........................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.0
hidden_size ..................................... 4096
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
lazy_mpu_init ................................... None
load ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_progress .................................... True
log_throughput .................................. True
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.0001
lr_decay_iters .................................. 320000
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. None
lr_warmup_init .................................. 0.0
lr_warmup_iters ................................. 500
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
manual_gc ....................................... False
manual_gc_eval .................................. True
manual_gc_interval .............................. 0
mask_factor ..................................... 1.0
mask_prob ....................................... 0.15
mask_type ....................................... random
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 32768
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_bin_files .................................. True
mock_data ....................................... False
moe_aux_loss_coeff .............................. 0.01
moe_grouped_gemm ................................ True
moe_input_jitter_eps ............................ None
moe_per_layer_logging ........................... False
moe_router_load_balancing_type .................. aux_loss
moe_router_topk ................................. 2
moe_token_dispatcher_type ....................... alltoall
moe_token_dropping .............................. False
moe_z_loss_coeff ................................ None
nccl_communicator_config_path ................... None
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... None
no_save_rng ..................................... None
norm_epsilon .................................... 1e-05
normalization ................................... RMSNorm
num_attention_heads ............................. 32
num_channels .................................... 3
num_classes ..................................... 1000
num_experts ..................................... 8
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_query_groups ................................ 8
num_workers ..................................... 2
one_logger_entity ............................... hwinf_dcm
one_logger_project .............................. e2e-tracking
one_logger_run_name ............................. None
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
output_bert_embeddings .......................... False
overlap_grad_reduce ............................. False
overlap_p2p_comm ................................ False
overlap_param_gather ............................ False
override_opt_param_scheduler .................... False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... True
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... rope
pretrained_checkpoint ........................... None
profile ......................................... True
profile_ranks ................................... [0]
profile_step_end ................................ 12
profile_step_start .............................. 10
qk_layernorm .................................... False
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ None
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
retro_add_retriever ............................. False
retro_attention_gate ............................ 1
retro_cyclic_train_iters ........................ None
retro_encoder_attention_dropout ................. 0.1
retro_encoder_hidden_dropout .................... 0.1
retro_encoder_layers ............................ 2
retro_num_neighbors ............................. 2
retro_num_retrieved_chunks ...................... 2
retro_project_dir ............................... None
retro_verify_neighbor_count ..................... True
rotary_interleaved .............................. False
rotary_percent .................................. 1.0
rotary_seq_len_interpolation_factor ............. None
sample_rate ..................................... 1.0
save ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
save_interval ................................... 1000
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 2048
sequence_parallel ............................... True
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_train ...................................... False
spec ............................................ None
split ........................................... 99990,8,2
squared_relu .................................... False
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.1
swiglu .......................................... True
swin_backbone_type .............................. tiny
tensor_model_parallel_size ...................... 2
tensorboard_dir ................................. custom/ckpt/mixtral-8x7b-tp2-ep8-mgg/tensorboard
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
test_mode ....................................... False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. custom/ckpt/mixtral-8x7b/tokenizer.model
tokenizer_type .................................. Llama2Tokenizer
tp_comm_bulk_dgrad .............................. True
tp_comm_bulk_wgrad .............................. True
tp_comm_overlap ................................. False
tp_comm_overlap_ag .............................. True
tp_comm_overlap_cfg ............................. None
tp_comm_overlap_rs .............................. True
tp_comm_split_ag ................................ True
tp_comm_split_rs ................................ True
train_data_path ................................. None
train_iters ..................................... 100
train_samples ................................... None
transformer_impl ................................ transformer_engine
transformer_pipeline_model_parallel_size ........ 1
untie_embeddings_and_output_weights ............. True
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_cpu_initialization .......................... None
use_dist_ckpt ................................... False
use_distributed_optimizer ....................... True
use_flash_attn .................................. True
use_mcore_models ................................ True
use_one_sent_docs ............................... False
use_ring_exchange_p2p ........................... False
use_rotary_position_embeddings .................. False
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vision_backbone_type ............................ vit
vision_pretraining .............................. False
vision_pretraining_type ......................... classify
vocab_extra_ids ................................. 0
vocab_file ...................................... None
vocab_size ...................................... None
wandb_exp_name ..................................
wandb_project ...................................
wandb_save_dir ..................................
weight_decay .................................... 0.1
weight_decay_incr_style ......................... constant
world_size ...................................... 16
yaml_cfg ........................................ None
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 16
> building Llama2Tokenizer tokenizer ...
> padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
make: Entering directory '.../Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '.../Megatron-LM/megatron/core/datasets'
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.104 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 7.866 seconds
[rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
[rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator())
time to initialize megatron (seconds): 14.235
[after megatron is initialized] datetime: 2024-04-06 02:54:57
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3622047744
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3622047744
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (803475456 elements):
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.embedding.word_embeddings.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.final_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.output_layer.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_proj.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.pre_mlp_layernorm.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.router.weight
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.weight
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements):
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight1
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight2
INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight1
INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=0.0001, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b366837d3f0>)
> learning rate decay style: cosine
loading release checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg
could not find arguments in the checkpoint ...
checkpoint version 0
succesfully fixed query-key-values ordering for checkpoint version 0
successfully loaded checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg [ t 0, p 0 ] at iteration 0
> setting tensorboard ...
(min, max) time across ranks (ms):
load-checkpoint ................................: (8126.15, 8126.65)
[after model, optimizer, and learning rate scheduler are built] datetime: 2024-04-06 02:55:06
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 12800
validation: 128
test: 128
INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False
INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.9999), (0.9999, 0.99998), (0.99998, 1.0)]
> building train, validation, and test datasets for GPT ...
INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from custom/data/wudao/wudao_mistralbpe_content_document.idx
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths
INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers
INFO:megatron.core.datasets.indexed_dataset: Extract the document indices
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 59132211
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 59132211
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 40201537
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from a625518736b8143e22f4f34c6682183e-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from a625518736b8143e22f4f34c6682183e-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from a625518736b8143e22f4f34c6682183e-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6204
INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices
INFO:megatron.core.datasets.gpt_dataset: Load the document index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-document_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the sample index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-sample_index.npy
INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-shuffle_index.npy
INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2332
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2024-04-06 02:55:07
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (8592.94, 8605.02)
train/valid/test-data-iterators-setup ..........: (569.02, 865.21)
training ...
[before the start of training step] datetime: 2024-04-06 02:55:07
Number of parameters in transformer layers in billions: 46.44
Number of parameters in embedding layers in billions: 0.26
Total number of parameters in billions: 46.70
Number of parameters in most loaded shard in billions: 23.3510
Theoretical memory footprints: weight and optimizer=167019.40 MB
[Rank 0] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61470.0 | max reserved: 61470.0
[2024-04-06 02:55:39] iteration 1/ 100 | consumed samples: 128 | elapsed time per iteration (ms): 32269.4 | throughput per GPU (TFLOP/s/GPU): 40.5 | learning rate: 2.000000E-07 | global batch size: 128 | lm loss: 1.985617E+00 | load_balancing_loss: 1.089786E+00 | loss scale: 1.0 | grad norm: 6.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 1] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61480.0 | max reserved: 61480.0
[2024-04-06 02:55:45] iteration 2/ 100 | consumed samples: 256 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 231.9 | learning rate: 4.000000E-07 | global batch size: 128 | lm loss: 2.021530E+00 | load_balancing_loss: 1.087362E+00 | loss scale: 1.0 | grad norm: 6.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:55:50] iteration 3/ 100 | consumed samples: 384 | elapsed time per iteration (ms): 5410.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 6.000000E-07 | global batch size: 128 | lm loss: 2.003316E+00 | load_balancing_loss: 1.085377E+00 | loss scale: 1.0 | grad norm: 6.603 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:55:55] iteration 4/ 100 | consumed samples: 512 | elapsed time per iteration (ms): 5364.1 | throughput per GPU (TFLOP/s/GPU): 243.5 | learning rate: 8.000000E-07 | global batch size: 128 | lm loss: 2.009657E+00 | load_balancing_loss: 1.091695E+00 | loss scale: 1.0 | grad norm: 6.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:01] iteration 5/ 100 | consumed samples: 640 | elapsed time per iteration (ms): 5496.7 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.000000E-06 | global batch size: 128 | lm loss: 2.002326E+00 | load_balancing_loss: 1.091539E+00 | loss scale: 1.0 | grad norm: 6.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:06] iteration 6/ 100 | consumed samples: 768 | elapsed time per iteration (ms): 5364.8 | throughput per GPU (TFLOP/s/GPU): 243.4 | learning rate: 1.200000E-06 | global batch size: 128 | lm loss: 1.933151E+00 | load_balancing_loss: 1.086472E+00 | loss scale: 1.0 | grad norm: 5.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:12] iteration 7/ 100 | consumed samples: 896 | elapsed time per iteration (ms): 5682.7 | throughput per GPU (TFLOP/s/GPU): 229.8 | learning rate: 1.400000E-06 | global batch size: 128 | lm loss: 2.016085E+00 | load_balancing_loss: 1.085193E+00 | loss scale: 1.0 | grad norm: 5.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:17] iteration 8/ 100 | consumed samples: 1024 | elapsed time per iteration (ms): 5408.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 1.600000E-06 | global batch size: 128 | lm loss: 1.965713E+00 | load_balancing_loss: 1.080933E+00 | loss scale: 1.0 | grad norm: 4.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:23] iteration 9/ 100 | consumed samples: 1152 | elapsed time per iteration (ms): 5590.1 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.800000E-06 | global batch size: 128 | lm loss: 1.919308E+00 | load_balancing_loss: 1.089582E+00 | loss scale: 1.0 | grad norm: 4.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:28] iteration 10/ 100 | consumed samples: 1280 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 2.000000E-06 | global batch size: 128 | lm loss: 1.978377E+00 | load_balancing_loss: 1.089948E+00 | loss scale: 1.0 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:34] iteration 11/ 100 | consumed samples: 1408 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 218.2 | learning rate: 2.200000E-06 | global batch size: 128 | lm loss: 1.889895E+00 | load_balancing_loss: 1.083618E+00 | loss scale: 1.0 | grad norm: 3.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:40] iteration 12/ 100 | consumed samples: 1536 | elapsed time per iteration (ms): 5821.8 | throughput per GPU (TFLOP/s/GPU): 224.3 | learning rate: 2.400000E-06 | global batch size: 128 | lm loss: 1.932808E+00 | load_balancing_loss: 1.085315E+00 | loss scale: 1.0 | grad norm: 3.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:46] iteration 13/ 100 | consumed samples: 1664 | elapsed time per iteration (ms): 5962.2 | throughput per GPU (TFLOP/s/GPU): 219.0 | learning rate: 2.600000E-06 | global batch size: 128 | lm loss: 1.911683E+00 | load_balancing_loss: 1.079515E+00 | loss scale: 1.0 | grad norm: 3.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:52] iteration 14/ 100 | consumed samples: 1792 | elapsed time per iteration (ms): 5927.4 | throughput per GPU (TFLOP/s/GPU): 220.3 | learning rate: 2.800000E-06 | global batch size: 128 | lm loss: 1.913695E+00 | load_balancing_loss: 1.076165E+00 | loss scale: 1.0 | grad norm: 2.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:56:58] iteration 15/ 100 | consumed samples: 1920 | elapsed time per iteration (ms): 5926.4 | throughput per GPU (TFLOP/s/GPU): 220.4 | learning rate: 3.000000E-06 | global batch size: 128 | lm loss: 1.957101E+00 | load_balancing_loss: 1.069903E+00 | loss scale: 1.0 | grad norm: 2.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:04] iteration 16/ 100 | consumed samples: 2048 | elapsed time per iteration (ms): 5912.7 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 3.200000E-06 | global batch size: 128 | lm loss: 1.915763E+00 | load_balancing_loss: 1.065748E+00 | loss scale: 1.0 | grad norm: 2.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:10] iteration 17/ 100 | consumed samples: 2176 | elapsed time per iteration (ms): 5706.3 | throughput per GPU (TFLOP/s/GPU): 228.9 | learning rate: 3.400000E-06 | global batch size: 128 | lm loss: 1.918353E+00 | load_balancing_loss: 1.064678E+00 | loss scale: 1.0 | grad norm: 2.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:15] iteration 18/ 100 | consumed samples: 2304 | elapsed time per iteration (ms): 5732.8 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 3.600000E-06 | global batch size: 128 | lm loss: 1.861051E+00 | load_balancing_loss: 1.058054E+00 | loss scale: 1.0 | grad norm: 2.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:21] iteration 19/ 100 | consumed samples: 2432 | elapsed time per iteration (ms): 5684.9 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 3.800000E-06 | global batch size: 128 | lm loss: 1.934895E+00 | load_balancing_loss: 1.049081E+00 | loss scale: 1.0 | grad norm: 2.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:27] iteration 20/ 100 | consumed samples: 2560 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 226.3 | learning rate: 4.000000E-06 | global batch size: 128 | lm loss: 1.932632E+00 | load_balancing_loss: 1.052491E+00 | loss scale: 1.0 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:32] iteration 21/ 100 | consumed samples: 2688 | elapsed time per iteration (ms): 5541.8 | throughput per GPU (TFLOP/s/GPU): 235.6 | learning rate: 4.200000E-06 | global batch size: 128 | lm loss: 1.904877E+00 | load_balancing_loss: 1.047207E+00 | loss scale: 1.0 | grad norm: 2.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:38] iteration 22/ 100 | consumed samples: 2816 | elapsed time per iteration (ms): 5576.7 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 4.400000E-06 | global batch size: 128 | lm loss: 1.872380E+00 | load_balancing_loss: 1.039512E+00 | loss scale: 1.0 | grad norm: 2.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:44] iteration 23/ 100 | consumed samples: 2944 | elapsed time per iteration (ms): 5807.4 | throughput per GPU (TFLOP/s/GPU): 224.9 | learning rate: 4.600000E-06 | global batch size: 128 | lm loss: 1.835408E+00 | load_balancing_loss: 1.042104E+00 | loss scale: 1.0 | grad norm: 2.034 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:50] iteration 24/ 100 | consumed samples: 3072 | elapsed time per iteration (ms): 5727.3 | throughput per GPU (TFLOP/s/GPU): 228.0 | learning rate: 4.800000E-06 | global batch size: 128 | lm loss: 1.898657E+00 | load_balancing_loss: 1.029742E+00 | loss scale: 1.0 | grad norm: 1.982 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:57:55] iteration 25/ 100 | consumed samples: 3200 | elapsed time per iteration (ms): 5498.4 | throughput per GPU (TFLOP/s/GPU): 237.5 | learning rate: 5.000000E-06 | global batch size: 128 | lm loss: 1.904866E+00 | load_balancing_loss: 1.034888E+00 | loss scale: 1.0 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:01] iteration 26/ 100 | consumed samples: 3328 | elapsed time per iteration (ms): 5531.7 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 5.200000E-06 | global batch size: 128 | lm loss: 1.889752E+00 | load_balancing_loss: 1.028931E+00 | loss scale: 1.0 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:06] iteration 27/ 100 | consumed samples: 3456 | elapsed time per iteration (ms): 5678.3 | throughput per GPU (TFLOP/s/GPU): 230.0 | learning rate: 5.400000E-06 | global batch size: 128 | lm loss: 1.866109E+00 | load_balancing_loss: 1.031736E+00 | loss scale: 1.0 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:12] iteration 28/ 100 | consumed samples: 3584 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 5.600000E-06 | global batch size: 128 | lm loss: 1.914117E+00 | load_balancing_loss: 1.027364E+00 | loss scale: 1.0 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:18] iteration 29/ 100 | consumed samples: 3712 | elapsed time per iteration (ms): 5912.1 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 5.800000E-06 | global batch size: 128 | lm loss: 1.867856E+00 | load_balancing_loss: 1.023825E+00 | loss scale: 1.0 | grad norm: 1.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:23] iteration 30/ 100 | consumed samples: 3840 | elapsed time per iteration (ms): 5571.1 | throughput per GPU (TFLOP/s/GPU): 234.4 | learning rate: 6.000000E-06 | global batch size: 128 | lm loss: 1.924535E+00 | load_balancing_loss: 1.025294E+00 | loss scale: 1.0 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:29] iteration 31/ 100 | consumed samples: 3968 | elapsed time per iteration (ms): 5718.9 | throughput per GPU (TFLOP/s/GPU): 228.3 | learning rate: 6.200000E-06 | global batch size: 128 | lm loss: 1.830754E+00 | load_balancing_loss: 1.028048E+00 | loss scale: 1.0 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:35] iteration 32/ 100 | consumed samples: 4096 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 232.0 | learning rate: 6.400000E-06 | global batch size: 128 | lm loss: 1.848776E+00 | load_balancing_loss: 1.021549E+00 | loss scale: 1.0 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:40] iteration 33/ 100 | consumed samples: 4224 | elapsed time per iteration (ms): 5600.4 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 6.600000E-06 | global batch size: 128 | lm loss: 1.917658E+00 | load_balancing_loss: 1.032319E+00 | loss scale: 1.0 | grad norm: 1.519 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:46] iteration 34/ 100 | consumed samples: 4352 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 231.4 | learning rate: 6.800000E-06 | global batch size: 128 | lm loss: 1.844636E+00 | load_balancing_loss: 1.019185E+00 | loss scale: 1.0 | grad norm: 1.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:51] iteration 35/ 100 | consumed samples: 4480 | elapsed time per iteration (ms): 5367.8 | throughput per GPU (TFLOP/s/GPU): 243.3 | learning rate: 7.000000E-06 | global batch size: 128 | lm loss: 1.853418E+00 | load_balancing_loss: 1.020990E+00 | loss scale: 1.0 | grad norm: 1.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:58:57] iteration 36/ 100 | consumed samples: 4608 | elapsed time per iteration (ms): 5399.9 | throughput per GPU (TFLOP/s/GPU): 241.8 | learning rate: 7.200000E-06 | global batch size: 128 | lm loss: 1.842918E+00 | load_balancing_loss: 1.023077E+00 | loss scale: 1.0 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:02] iteration 37/ 100 | consumed samples: 4736 | elapsed time per iteration (ms): 5515.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 7.400000E-06 | global batch size: 128 | lm loss: 1.862270E+00 | load_balancing_loss: 1.023782E+00 | loss scale: 1.0 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:08] iteration 38/ 100 | consumed samples: 4864 | elapsed time per iteration (ms): 5477.8 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 7.600000E-06 | global batch size: 128 | lm loss: 1.862543E+00 | load_balancing_loss: 1.019304E+00 | loss scale: 1.0 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:13] iteration 39/ 100 | consumed samples: 4992 | elapsed time per iteration (ms): 5649.1 | throughput per GPU (TFLOP/s/GPU): 231.2 | learning rate: 7.800000E-06 | global batch size: 128 | lm loss: 1.863421E+00 | load_balancing_loss: 1.017805E+00 | loss scale: 1.0 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:19] iteration 40/ 100 | consumed samples: 5120 | elapsed time per iteration (ms): 5810.4 | throughput per GPU (TFLOP/s/GPU): 224.8 | learning rate: 8.000000E-06 | global batch size: 128 | lm loss: 1.879655E+00 | load_balancing_loss: 1.017568E+00 | loss scale: 1.0 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:25] iteration 41/ 100 | consumed samples: 5248 | elapsed time per iteration (ms): 5462.9 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 8.200000E-06 | global batch size: 128 | lm loss: 1.812076E+00 | load_balancing_loss: 1.020508E+00 | loss scale: 1.0 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:30] iteration 42/ 100 | consumed samples: 5376 | elapsed time per iteration (ms): 5452.3 | throughput per GPU (TFLOP/s/GPU): 239.5 | learning rate: 8.400000E-06 | global batch size: 128 | lm loss: 1.824542E+00 | load_balancing_loss: 1.017472E+00 | loss scale: 1.0 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:36] iteration 43/ 100 | consumed samples: 5504 | elapsed time per iteration (ms): 5444.9 | throughput per GPU (TFLOP/s/GPU): 239.8 | learning rate: 8.600000E-06 | global batch size: 128 | lm loss: 1.825991E+00 | load_balancing_loss: 1.019746E+00 | loss scale: 1.0 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:41] iteration 44/ 100 | consumed samples: 5632 | elapsed time per iteration (ms): 5533.8 | throughput per GPU (TFLOP/s/GPU): 236.0 | learning rate: 8.800000E-06 | global batch size: 128 | lm loss: 1.875063E+00 | load_balancing_loss: 1.020033E+00 | loss scale: 1.0 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:47] iteration 45/ 100 | consumed samples: 5760 | elapsed time per iteration (ms): 5718.6 | throughput per GPU (TFLOP/s/GPU): 228.4 | learning rate: 9.000000E-06 | global batch size: 128 | lm loss: 1.834162E+00 | load_balancing_loss: 1.018004E+00 | loss scale: 1.0 | grad norm: 1.611 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:52] iteration 46/ 100 | consumed samples: 5888 | elapsed time per iteration (ms): 5567.2 | throughput per GPU (TFLOP/s/GPU): 234.6 | learning rate: 9.200000E-06 | global batch size: 128 | lm loss: 1.883577E+00 | load_balancing_loss: 1.016062E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 02:59:58] iteration 47/ 100 | consumed samples: 6016 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 9.400000E-06 | global batch size: 128 | lm loss: 1.836727E+00 | load_balancing_loss: 1.019520E+00 | loss scale: 1.0 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:04] iteration 48/ 100 | consumed samples: 6144 | elapsed time per iteration (ms): 5872.4 | throughput per GPU (TFLOP/s/GPU): 222.4 | learning rate: 9.600000E-06 | global batch size: 128 | lm loss: 1.855191E+00 | load_balancing_loss: 1.017754E+00 | loss scale: 1.0 | grad norm: 1.508 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:09] iteration 49/ 100 | consumed samples: 6272 | elapsed time per iteration (ms): 5528.7 | throughput per GPU (TFLOP/s/GPU): 236.2 | learning rate: 9.800000E-06 | global batch size: 128 | lm loss: 1.806294E+00 | load_balancing_loss: 1.017504E+00 | loss scale: 1.0 | grad norm: 1.529 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:15] iteration 50/ 100 | consumed samples: 6400 | elapsed time per iteration (ms): 5531.5 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 1.000000E-05 | global batch size: 128 | lm loss: 1.887587E+00 | load_balancing_loss: 1.016094E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:21] iteration 51/ 100 | consumed samples: 6528 | elapsed time per iteration (ms): 5501.3 | throughput per GPU (TFLOP/s/GPU): 237.4 | learning rate: 1.020000E-05 | global batch size: 128 | lm loss: 1.834414E+00 | load_balancing_loss: 1.015084E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:26] iteration 52/ 100 | consumed samples: 6656 | elapsed time per iteration (ms): 5520.9 | throughput per GPU (TFLOP/s/GPU): 236.5 | learning rate: 1.040000E-05 | global batch size: 128 | lm loss: 1.847078E+00 | load_balancing_loss: 1.015950E+00 | loss scale: 1.0 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:32] iteration 53/ 100 | consumed samples: 6784 | elapsed time per iteration (ms): 5711.6 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.060000E-05 | global batch size: 128 | lm loss: 1.862840E+00 | load_balancing_loss: 1.016317E+00 | loss scale: 1.0 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:37] iteration 54/ 100 | consumed samples: 6912 | elapsed time per iteration (ms): 5689.4 | throughput per GPU (TFLOP/s/GPU): 229.5 | learning rate: 1.080000E-05 | global batch size: 128 | lm loss: 1.897956E+00 | load_balancing_loss: 1.017408E+00 | loss scale: 1.0 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:43] iteration 55/ 100 | consumed samples: 7040 | elapsed time per iteration (ms): 5763.8 | throughput per GPU (TFLOP/s/GPU): 226.6 | learning rate: 1.100000E-05 | global batch size: 128 | lm loss: 1.863309E+00 | load_balancing_loss: 1.014457E+00 | loss scale: 1.0 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:49] iteration 56/ 100 | consumed samples: 7168 | elapsed time per iteration (ms): 5742.1 | throughput per GPU (TFLOP/s/GPU): 227.4 | learning rate: 1.120000E-05 | global batch size: 128 | lm loss: 1.899538E+00 | load_balancing_loss: 1.018558E+00 | loss scale: 1.0 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:00:54] iteration 57/ 100 | consumed samples: 7296 | elapsed time per iteration (ms): 5450.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.140000E-05 | global batch size: 128 | lm loss: 1.864605E+00 | load_balancing_loss: 1.015150E+00 | loss scale: 1.0 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:00] iteration 58/ 100 | consumed samples: 7424 | elapsed time per iteration (ms): 5538.9 | throughput per GPU (TFLOP/s/GPU): 235.8 | learning rate: 1.160000E-05 | global batch size: 128 | lm loss: 1.812579E+00 | load_balancing_loss: 1.020851E+00 | loss scale: 1.0 | grad norm: 1.610 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:05] iteration 59/ 100 | consumed samples: 7552 | elapsed time per iteration (ms): 5410.9 | throughput per GPU (TFLOP/s/GPU): 241.3 | learning rate: 1.180000E-05 | global batch size: 128 | lm loss: 1.848337E+00 | load_balancing_loss: 1.013638E+00 | loss scale: 1.0 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:11] iteration 60/ 100 | consumed samples: 7680 | elapsed time per iteration (ms): 5603.1 | throughput per GPU (TFLOP/s/GPU): 233.1 | learning rate: 1.200000E-05 | global batch size: 128 | lm loss: 1.801180E+00 | load_balancing_loss: 1.019084E+00 | loss scale: 1.0 | grad norm: 1.549 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:16] iteration 61/ 100 | consumed samples: 7808 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.220000E-05 | global batch size: 128 | lm loss: 1.813972E+00 | load_balancing_loss: 1.014779E+00 | loss scale: 1.0 | grad norm: 1.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:22] iteration 62/ 100 | consumed samples: 7936 | elapsed time per iteration (ms): 5753.1 | throughput per GPU (TFLOP/s/GPU): 227.0 | learning rate: 1.240000E-05 | global batch size: 128 | lm loss: 1.808689E+00 | load_balancing_loss: 1.022012E+00 | loss scale: 1.0 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:28] iteration 63/ 100 | consumed samples: 8064 | elapsed time per iteration (ms): 5650.1 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 1.260000E-05 | global batch size: 128 | lm loss: 1.781526E+00 | load_balancing_loss: 1.013716E+00 | loss scale: 1.0 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:33] iteration 64/ 100 | consumed samples: 8192 | elapsed time per iteration (ms): 5539.7 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.280000E-05 | global batch size: 128 | lm loss: 1.871476E+00 | load_balancing_loss: 1.019044E+00 | loss scale: 1.0 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:39] iteration 65/ 100 | consumed samples: 8320 | elapsed time per iteration (ms): 5493.9 | throughput per GPU (TFLOP/s/GPU): 237.7 | learning rate: 1.300000E-05 | global batch size: 128 | lm loss: 1.846450E+00 | load_balancing_loss: 1.017387E+00 | loss scale: 1.0 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:44] iteration 66/ 100 | consumed samples: 8448 | elapsed time per iteration (ms): 5590.8 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.320000E-05 | global batch size: 128 | lm loss: 1.873755E+00 | load_balancing_loss: 1.014257E+00 | loss scale: 1.0 | grad norm: 1.411 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:50] iteration 67/ 100 | consumed samples: 8576 | elapsed time per iteration (ms): 5710.3 | throughput per GPU (TFLOP/s/GPU): 228.7 | learning rate: 1.340000E-05 | global batch size: 128 | lm loss: 1.765591E+00 | load_balancing_loss: 1.016482E+00 | loss scale: 1.0 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:01:56] iteration 68/ 100 | consumed samples: 8704 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 227.7 | learning rate: 1.360000E-05 | global batch size: 128 | lm loss: 1.839895E+00 | load_balancing_loss: 1.012786E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:01] iteration 69/ 100 | consumed samples: 8832 | elapsed time per iteration (ms): 5478.6 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 1.380000E-05 | global batch size: 128 | lm loss: 1.912256E+00 | load_balancing_loss: 1.013041E+00 | loss scale: 1.0 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:07] iteration 70/ 100 | consumed samples: 8960 | elapsed time per iteration (ms): 5514.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 1.400000E-05 | global batch size: 128 | lm loss: 1.873068E+00 | load_balancing_loss: 1.012509E+00 | loss scale: 1.0 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:12] iteration 71/ 100 | consumed samples: 9088 | elapsed time per iteration (ms): 5361.6 | throughput per GPU (TFLOP/s/GPU): 243.6 | learning rate: 1.420000E-05 | global batch size: 128 | lm loss: 1.818812E+00 | load_balancing_loss: 1.013377E+00 | loss scale: 1.0 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:18] iteration 72/ 100 | consumed samples: 9216 | elapsed time per iteration (ms): 5470.7 | throughput per GPU (TFLOP/s/GPU): 238.7 | learning rate: 1.440000E-05 | global batch size: 128 | lm loss: 1.820313E+00 | load_balancing_loss: 1.019612E+00 | loss scale: 1.0 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:24] iteration 73/ 100 | consumed samples: 9344 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 224.0 | learning rate: 1.460000E-05 | global batch size: 128 | lm loss: 1.798953E+00 | load_balancing_loss: 1.010977E+00 | loss scale: 1.0 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:29] iteration 74/ 100 | consumed samples: 9472 | elapsed time per iteration (ms): 5702.4 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.480000E-05 | global batch size: 128 | lm loss: 1.774078E+00 | load_balancing_loss: 1.012441E+00 | loss scale: 1.0 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:35] iteration 75/ 100 | consumed samples: 9600 | elapsed time per iteration (ms): 5599.5 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.500000E-05 | global batch size: 128 | lm loss: 1.838492E+00 | load_balancing_loss: 1.015038E+00 | loss scale: 1.0 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:40] iteration 76/ 100 | consumed samples: 9728 | elapsed time per iteration (ms): 5588.2 | throughput per GPU (TFLOP/s/GPU): 233.7 | learning rate: 1.520000E-05 | global batch size: 128 | lm loss: 1.860703E+00 | load_balancing_loss: 1.012689E+00 | loss scale: 1.0 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:46] iteration 77/ 100 | consumed samples: 9856 | elapsed time per iteration (ms): 5425.4 | throughput per GPU (TFLOP/s/GPU): 240.7 | learning rate: 1.540000E-05 | global batch size: 128 | lm loss: 1.827507E+00 | load_balancing_loss: 1.012502E+00 | loss scale: 1.0 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:52] iteration 78/ 100 | consumed samples: 9984 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 231.0 | learning rate: 1.560000E-05 | global batch size: 128 | lm loss: 1.784492E+00 | load_balancing_loss: 1.013809E+00 | loss scale: 1.0 | grad norm: 1.407 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:02:57] iteration 79/ 100 | consumed samples: 10112 | elapsed time per iteration (ms): 5577.0 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 1.580000E-05 | global batch size: 128 | lm loss: 1.858489E+00 | load_balancing_loss: 1.011662E+00 | loss scale: 1.0 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:03] iteration 80/ 100 | consumed samples: 10240 | elapsed time per iteration (ms): 5712.8 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.600000E-05 | global batch size: 128 | lm loss: 1.842588E+00 | load_balancing_loss: 1.011640E+00 | loss scale: 1.0 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:09] iteration 81/ 100 | consumed samples: 10368 | elapsed time per iteration (ms): 5684.5 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 1.620000E-05 | global batch size: 128 | lm loss: 1.818980E+00 | load_balancing_loss: 1.012697E+00 | loss scale: 1.0 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:14] iteration 82/ 100 | consumed samples: 10496 | elapsed time per iteration (ms): 5592.0 | throughput per GPU (TFLOP/s/GPU): 233.5 | learning rate: 1.640000E-05 | global batch size: 128 | lm loss: 1.805010E+00 | load_balancing_loss: 1.012805E+00 | loss scale: 1.0 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:20] iteration 83/ 100 | consumed samples: 10624 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 231.5 | learning rate: 1.660000E-05 | global batch size: 128 | lm loss: 1.812314E+00 | load_balancing_loss: 1.011967E+00 | loss scale: 1.0 | grad norm: 1.530 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:25] iteration 84/ 100 | consumed samples: 10752 | elapsed time per iteration (ms): 5563.7 | throughput per GPU (TFLOP/s/GPU): 234.7 | learning rate: 1.680000E-05 | global batch size: 128 | lm loss: 1.822110E+00 | load_balancing_loss: 1.009684E+00 | loss scale: 1.0 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:31] iteration 85/ 100 | consumed samples: 10880 | elapsed time per iteration (ms): 5580.9 | throughput per GPU (TFLOP/s/GPU): 234.0 | learning rate: 1.700000E-05 | global batch size: 128 | lm loss: 1.831795E+00 | load_balancing_loss: 1.009344E+00 | loss scale: 1.0 | grad norm: 1.578 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:37] iteration 86/ 100 | consumed samples: 11008 | elapsed time per iteration (ms): 5695.8 | throughput per GPU (TFLOP/s/GPU): 229.3 | learning rate: 1.720000E-05 | global batch size: 128 | lm loss: 1.831625E+00 | load_balancing_loss: 1.011533E+00 | loss scale: 1.0 | grad norm: 1.515 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:42] iteration 87/ 100 | consumed samples: 11136 | elapsed time per iteration (ms): 5444.5 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.740000E-05 | global batch size: 128 | lm loss: 1.814374E+00 | load_balancing_loss: 1.010052E+00 | loss scale: 1.0 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:48] iteration 88/ 100 | consumed samples: 11264 | elapsed time per iteration (ms): 5462.7 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 1.760000E-05 | global batch size: 128 | lm loss: 1.825778E+00 | load_balancing_loss: 1.010838E+00 | loss scale: 1.0 | grad norm: 1.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:53] iteration 89/ 100 | consumed samples: 11392 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 231.8 | learning rate: 1.780000E-05 | global batch size: 128 | lm loss: 1.818898E+00 | load_balancing_loss: 1.011014E+00 | loss scale: 1.0 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:03:59] iteration 90/ 100 | consumed samples: 11520 | elapsed time per iteration (ms): 5567.8 | throughput per GPU (TFLOP/s/GPU): 234.5 | learning rate: 1.800000E-05 | global batch size: 128 | lm loss: 1.813602E+00 | load_balancing_loss: 1.022434E+00 | loss scale: 1.0 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:04] iteration 91/ 100 | consumed samples: 11648 | elapsed time per iteration (ms): 5691.9 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 1.820000E-05 | global batch size: 128 | lm loss: 1.797111E+00 | load_balancing_loss: 1.011964E+00 | loss scale: 1.0 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:10] iteration 92/ 100 | consumed samples: 11776 | elapsed time per iteration (ms): 5451.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.840000E-05 | global batch size: 128 | lm loss: 1.809117E+00 | load_balancing_loss: 1.012038E+00 | loss scale: 1.0 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:15] iteration 93/ 100 | consumed samples: 11904 | elapsed time per iteration (ms): 5599.2 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.860000E-05 | global batch size: 128 | lm loss: 1.797812E+00 | load_balancing_loss: 1.011838E+00 | loss scale: 1.0 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:21] iteration 94/ 100 | consumed samples: 12032 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.880000E-05 | global batch size: 128 | lm loss: 1.865515E+00 | load_balancing_loss: 1.013109E+00 | loss scale: 1.0 | grad norm: 1.603 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:26] iteration 95/ 100 | consumed samples: 12160 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.900000E-05 | global batch size: 128 | lm loss: 1.845348E+00 | load_balancing_loss: 1.012796E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:32] iteration 96/ 100 | consumed samples: 12288 | elapsed time per iteration (ms): 5702.2 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.920000E-05 | global batch size: 128 | lm loss: 1.843516E+00 | load_balancing_loss: 1.010116E+00 | loss scale: 1.0 | grad norm: 1.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:38] iteration 97/ 100 | consumed samples: 12416 | elapsed time per iteration (ms): 5733.2 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 1.940000E-05 | global batch size: 128 | lm loss: 1.876754E+00 | load_balancing_loss: 1.011542E+00 | loss scale: 1.0 | grad norm: 1.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:43] iteration 98/ 100 | consumed samples: 12544 | elapsed time per iteration (ms): 5556.4 | throughput per GPU (TFLOP/s/GPU): 235.0 | learning rate: 1.960000E-05 | global batch size: 128 | lm loss: 1.810738E+00 | load_balancing_loss: 1.010371E+00 | loss scale: 1.0 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:49] iteration 99/ 100 | consumed samples: 12672 | elapsed time per iteration (ms): 5523.5 | throughput per GPU (TFLOP/s/GPU): 236.4 | learning rate: 1.980000E-05 | global batch size: 128 | lm loss: 1.872008E+00 | load_balancing_loss: 1.008882E+00 | loss scale: 1.0 | grad norm: 1.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2024-04-06 03:04:54] iteration 100/ 100 | consumed samples: 12800 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 2.000000E-05 | global batch size: 128 | lm loss: 1.824753E+00 | load_balancing_loss: 1.009905E+00 | loss scale: 1.0 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[after training is done] datetime: 2024-04-06 03:04:54
Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:
- Update the code to the latest main branch and upgrade grouped_gemm to v1.0.
- Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.
- Use EP8TP2.
- Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.
If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?
Hi, thanks for the suggestions. I retested the throuput according to your suggestion. To be more specific:
- Update Megatron-LM the latest commit (ba77325)
- Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)
- Set
--moe-token-dispatcher-type alltoall
- Switch to EP=8 & TP=2
- Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)
The throughput has indeed increased significantly, reaching around 230 TFLOP/s. However, for H100, it's still pretty low, isn't it? May I ask, theoretically, what would be a more reasonable throughput?
Here is the logs
Apologies for the delayed reply. 230 TFLOPS falls below our expectations; Currently, we can exceed 330TFLOPS on the H100 and potentially higher by switching to EP8TP1 with re-computation.
Does that mean you can achieve over 330 TFLOPS in the same or similar software environment and settings?
Should I then suspect hardware-related issues, such as network speeds between nodes?
Hi @ShinoharaHare , our env is:
- DGX H100, 64 GPUs.
- pytorch 24.03 image..
I double-checked your scripts and suggest the following modifications:
- Seq Len: 2048 -> 4096
- enable dp overlap: --overlap-grad-reduce --overlap-param-gather
Let's see how performance changes after these changes ^ ^.
Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:
- Update the code to the latest main branch and upgrade grouped_gemm to v1.0.
- Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.
- Use EP8TP2.
- Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.
If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?
Hi XLZed, MCore MoE does support PP, but for the Mixtral 8x7B model, we prefer EP and TP.
Thank you for reporting this issue. 130 TFLOPS is indeed too low for the H100. I quickly reviewed your script and have some suggestions:
- Update the code to the latest main branch and upgrade grouped_gemm to v1.0.
- Use alltoall dispathcer: --moe-token-dispatcher-type alltoall.
- Use EP8TP2.
- Train for a while (at least 400 steps) before checking performance, or load a pretrained checkpoint. This is because router weights in early stage are not sufficiently trained, leading to imbalanced token distribution.
If expert_parallel_size==num_moe_experts, the num_local_experts is 1 and GroupedMLP is same as SequentialMLP, is it right? And as I know, the communication overhead of pp is less than tp and ep if the proportion of bubble time is not too high, is MoE support pp and make it more efficient?
Hi XLZed, MCore MoE does support PP, but for the Mixtral 8x7B model, we prefer EP and TP.
Does grouped_gemm support variable token lengths to local experts on the same rank?
Does grouped_gemm support variable token lengths to local experts on the same rank?
Yes, we support variable lengths for inputs from each local expert.
Hi @ShinoharaHare , our env is:
- DGX H100, 64 GPUs.
- pytorch 24.03 image..
I double-checked your scripts and suggest the following modifications:
- Seq Len: 2048 -> 4096
- enable dp overlap: --overlap-grad-reduce --overlap-param-gather
Let's see how performance changes after these changes ^ ^.
@yanring
Enabling --overlap-grad-reduce
and --overlap-param-gather
will result in a CUDA error: uncorrectable ECC error encountered
, which seems essentially caused by OOM.
I've tried setting the sequence length to 4096 before, but doing so results in a CUDA OOM directly.
I've also tried adding --recompute-activations
in both scenarios, but still get OOM.
Hi, thanks for the suggestions.
I retested the throuput according to your suggestion.
To be more specific:
- Update Megatron-LM the latest commit (ba77325)
- Update grouped_gemm to v1.0.0 (fanshiqing/grouped_gemm@7a7f018)
- Set
--moe-token-dispatcher-type alltoall
- Switch to EP=8 & TP=2
- Use the pre-trained weights from Mixtral AI (converted from hf checkpoint)
The throughput has indeed increased significantly, reaching around 230 TFLOP/s.
However, for H100, it's still pretty low, isn't it?
May I ask, theoretically, what would be a more reasonable throughput?Here is the logs
using world size: 16, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 1 WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:Llama2Tokenizer WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication accumulate and all-reduce gradients in fp32 for bfloat16 data type. using torch.bfloat16 for parameters ... ------------------------ arguments ------------------------ accumulate_allreduce_grads_in_fp32 .............. True adam_beta1 ...................................... 0.9 adam_beta2 ...................................... 0.999 adam_eps ........................................ 1e-08 add_bias_linear ................................. False add_position_embedding .......................... True add_qkv_bias .................................... False adlr_autoresume ................................. False adlr_autoresume_interval ........................ 1000 apply_layernorm_1p .............................. False apply_query_key_layer_scaling ................... False apply_residual_connection_post_layernorm ........ False apply_rope_fusion ............................... True async_tensor_model_parallel_allreduce ........... False attention_dropout ............................... 0.0 attention_softmax_in_fp32 ....................... False auto_detect_ckpt_format ......................... False barrier_with_L1_time ............................ True bert_binary_head ................................ True bert_embedder_type .............................. megatron bert_load ....................................... None bf16 ............................................ True bias_dropout_fusion ............................. True bias_gelu_fusion ................................ False bias_swiglu_fusion .............................. True biencoder_projection_dim ........................ 0 biencoder_shared_query_context_model ............ False block_data_path ................................. None check_for_nan_in_loss_and_grad .................. True ckpt_fully_parallel_save ........................ False ckpt_step ....................................... None classes_fraction ................................ 1.0 clip_grad ....................................... 1.0 clone_scatter_output_in_embedding ............... True consumed_train_samples .......................... 0 consumed_valid_samples .......................... 0 context_parallel_size ........................... 1 create_attention_mask_in_dataloader ............. True data_cache_path ................................. None data_parallel_random_init ....................... False data_parallel_size .............................. 8 data_path ....................................... ['custom/data/wudao/wudao_mistralbpe_content_document'] data_per_class_fraction ......................... 1.0 data_sharding ................................... True dataloader_type ................................. single decoder_num_layers .............................. None decoder_seq_length .............................. None decoupled_lr .................................... None decoupled_min_lr ................................ None delay_grad_reduce ............................... True delay_param_gather .............................. False dino_bottleneck_size ............................ 256 dino_freeze_last_layer .......................... 1 dino_head_hidden_size ........................... 2048 dino_local_crops_number ......................... 10 dino_local_img_size ............................. 96 dino_norm_last_layer ............................ False dino_teacher_temp ............................... 0.07 dino_warmup_teacher_temp ........................ 0.04 dino_warmup_teacher_temp_epochs ................. 30 dist_ckpt_format ................................ torch_dist distribute_saved_activations .................... False distributed_backend ............................. nccl distributed_timeout_minutes ..................... 10 embedding_path .................................. None empty_unused_memory_level ....................... 0 enable_one_logger ............................... False encoder_num_layers .............................. 32 encoder_seq_length .............................. 2048 end_weight_decay ................................ 0.1 eod_mask_loss ................................... False eval_interval ................................... 1000 eval_iters ...................................... 1 evidence_data_path .............................. None exit_duration_in_mins ........................... None exit_interval ................................... None exit_on_missing_checkpoint ...................... False exit_signal_handler ............................. False expert_model_parallel_size ...................... 8 ffn_hidden_size ................................. 14336 finetune ........................................ False fp16 ............................................ False fp16_lm_cross_entropy ........................... False fp32_residual_connection ........................ False fp8 ............................................. None fp8_amax_compute_algo ........................... most_recent fp8_amax_history_len ............................ 1 fp8_interval .................................... 1 fp8_margin ...................................... 0 fp8_wgrad ....................................... True global_batch_size ............................... 128 gradient_accumulation_fusion .................... True group_query_attention ........................... True head_lr_mult .................................... 1.0 hidden_dropout .................................. 0.0 hidden_size ..................................... 4096 hysteresis ...................................... 2 ict_head_size ................................... None ict_load ........................................ None img_h ........................................... 224 img_w ........................................... 224 indexer_batch_size .............................. 128 indexer_log_interval ............................ 1000 inference_batch_times_seqlen_threshold .......... 512 init_method_std ................................. 0.02 init_method_xavier_uniform ...................... False initial_loss_scale .............................. 4294967296 iter_per_epoch .................................. 1250 kv_channels ..................................... 128 lazy_mpu_init ................................... None load ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg local_rank ...................................... None log_batch_size_to_tensorboard ................... False log_interval .................................... 1 log_learning_rate_to_tensorboard ................ True log_loss_scale_to_tensorboard ................... True log_memory_to_tensorboard ....................... False log_num_zeros_in_grad ........................... False log_params_norm ................................. False log_progress .................................... True log_throughput .................................. True log_timers_to_tensorboard ....................... False log_validation_ppl_to_tensorboard ............... False log_world_size_to_tensorboard ................... False loss_scale ...................................... None loss_scale_window ............................... 1000 lr .............................................. 0.0001 lr_decay_iters .................................. 320000 lr_decay_samples ................................ None lr_decay_style .................................. cosine lr_warmup_fraction .............................. None lr_warmup_init .................................. 0.0 lr_warmup_iters ................................. 500 lr_warmup_samples ............................... 0 make_vocab_size_divisible_by .................... 128 manual_gc ....................................... False manual_gc_eval .................................. True manual_gc_interval .............................. 0 mask_factor ..................................... 1.0 mask_prob ....................................... 0.15 mask_type ....................................... random masked_softmax_fusion ........................... False max_position_embeddings ......................... 32768 max_tokens_to_oom ............................... 12000 merge_file ...................................... None micro_batch_size ................................ 1 min_loss_scale .................................. 1.0 min_lr .......................................... 1e-05 mmap_bin_files .................................. True mock_data ....................................... False moe_aux_loss_coeff .............................. 0.01 moe_grouped_gemm ................................ True moe_input_jitter_eps ............................ None moe_per_layer_logging ........................... False moe_router_load_balancing_type .................. aux_loss moe_router_topk ................................. 2 moe_token_dispatcher_type ....................... alltoall moe_token_dropping .............................. False moe_z_loss_coeff ................................ None nccl_communicator_config_path ................... None no_load_optim ................................... True no_load_rng ..................................... True no_persist_layer_norm ........................... False no_save_optim ................................... None no_save_rng ..................................... None norm_epsilon .................................... 1e-05 normalization ................................... RMSNorm num_attention_heads ............................. 32 num_channels .................................... 3 num_classes ..................................... 1000 num_experts ..................................... 8 num_layers ...................................... 32 num_layers_per_virtual_pipeline_stage ........... None num_query_groups ................................ 8 num_workers ..................................... 2 one_logger_entity ............................... hwinf_dcm one_logger_project .............................. e2e-tracking one_logger_run_name ............................. None onnx_safe ....................................... None openai_gelu ..................................... False optimizer ....................................... adam output_bert_embeddings .......................... False overlap_grad_reduce ............................. False overlap_p2p_comm ................................ False overlap_param_gather ............................ False override_opt_param_scheduler .................... False params_dtype .................................... torch.bfloat16 patch_dim ....................................... 16 perform_initialization .......................... True pipeline_model_parallel_size .................... 1 pipeline_model_parallel_split_rank .............. None position_embedding_type ......................... rope pretrained_checkpoint ........................... None profile ......................................... True profile_ranks ................................... [0] profile_step_end ................................ 12 profile_step_start .............................. 10 qk_layernorm .................................... False query_in_block_prob ............................. 0.1 rampup_batch_size ............................... None rank ............................................ 0 recompute_granularity ........................... None recompute_method ................................ None recompute_num_layers ............................ None reset_attention_mask ............................ False reset_position_ids .............................. False retriever_report_topk_accuracies ................ [] retriever_score_scaling ......................... False retriever_seq_length ............................ 256 retro_add_retriever ............................. False retro_attention_gate ............................ 1 retro_cyclic_train_iters ........................ None retro_encoder_attention_dropout ................. 0.1 retro_encoder_hidden_dropout .................... 0.1 retro_encoder_layers ............................ 2 retro_num_neighbors ............................. 2 retro_num_retrieved_chunks ...................... 2 retro_project_dir ............................... None retro_verify_neighbor_count ..................... True rotary_interleaved .............................. False rotary_percent .................................. 1.0 rotary_seq_len_interpolation_factor ............. None sample_rate ..................................... 1.0 save ............................................ custom/ckpt/mixtral-8x7b-tp2-ep8-mgg save_interval ................................... 1000 scatter_gather_tensors_in_pipeline .............. True seed ............................................ 1234 seq_length ...................................... 2048 sequence_parallel ............................... True sgd_momentum .................................... 0.9 short_seq_prob .................................. 0.1 skip_train ...................................... False spec ............................................ None split ........................................... 99990,8,2 squared_relu .................................... False standalone_embedding_stage ...................... False start_weight_decay .............................. 0.1 swiglu .......................................... True swin_backbone_type .............................. tiny tensor_model_parallel_size ...................... 2 tensorboard_dir ................................. custom/ckpt/mixtral-8x7b-tp2-ep8-mgg/tensorboard tensorboard_log_interval ........................ 1 tensorboard_queue_size .......................... 1000 test_data_path .................................. None test_mode ....................................... False timing_log_level ................................ 0 timing_log_option ............................... minmax titles_data_path ................................ None tokenizer_model ................................. custom/ckpt/mixtral-8x7b/tokenizer.model tokenizer_type .................................. Llama2Tokenizer tp_comm_bulk_dgrad .............................. True tp_comm_bulk_wgrad .............................. True tp_comm_overlap ................................. False tp_comm_overlap_ag .............................. True tp_comm_overlap_cfg ............................. None tp_comm_overlap_rs .............................. True tp_comm_split_ag ................................ True tp_comm_split_rs ................................ True train_data_path ................................. None train_iters ..................................... 100 train_samples ................................... None transformer_impl ................................ transformer_engine transformer_pipeline_model_parallel_size ........ 1 untie_embeddings_and_output_weights ............. True use_checkpoint_args ............................. False use_checkpoint_opt_param_scheduler .............. False use_cpu_initialization .......................... None use_dist_ckpt ................................... False use_distributed_optimizer ....................... True use_flash_attn .................................. True use_mcore_models ................................ True use_one_sent_docs ............................... False use_ring_exchange_p2p ........................... False use_rotary_position_embeddings .................. False valid_data_path ................................. None variable_seq_lengths ............................ False virtual_pipeline_model_parallel_size ............ None vision_backbone_type ............................ vit vision_pretraining .............................. False vision_pretraining_type ......................... classify vocab_extra_ids ................................. 0 vocab_file ...................................... None vocab_size ...................................... None wandb_exp_name .................................. wandb_project ................................... wandb_save_dir .................................. weight_decay .................................... 0.1 weight_decay_incr_style ......................... constant world_size ...................................... 16 yaml_cfg ........................................ None -------------------- end of arguments --------------------- setting number of micro-batches to constant 16 > building Llama2Tokenizer tokenizer ... > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000) > initializing torch distributed ... make: Entering directory '.../Megatron-LM/megatron/core/datasets' make: Nothing to be done for 'default'. make: Leaving directory '.../Megatron-LM/megatron/core/datasets' > initialized tensor model parallel with size 2 > initialized pipeline model parallel with size 1 > setting random seeds to 1234 ... > compiling dataset index builder ... >>> done with dataset index builder. Compilation time: 0.104 seconds WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations. > compiling and loading fused kernels ... >>> done with compiling and loading fused kernels. Compilation time: 7.866 seconds [rank1]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank8]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank2]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank9]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank10]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank0]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank3]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank11]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank4]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank12]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank5]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank13]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank6]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank7]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank14]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) [rank15]:[W init.cpp:767] Warning: nvfuser is no longer supported in torch script, use _jit_set_nvfuser_enabled is deprecated and a no-op (function operator()) time to initialize megatron (seconds): 14.235 [after megatron is initialized] datetime: 2024-04-06 02:54:57 building GPT model ... > number of parameters on (tensor, pipeline) model parallel rank (1, 0): 3622047744 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 3622047744 INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (803475456 elements): INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.embedding.word_embeddings.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.final_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.output_layer.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.self_attention.linear_qkv.layer_norm_weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.self_attention.linear_proj.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.pre_mlp_layernorm.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.router.weight INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.self_attention.linear_qkv.weight INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1 INFO:megatron.core.distributed.param_and_grad_buffer:Params for bucket 1 (2818572288 elements): INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.12.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.26.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.20.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.1.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.30.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.18.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.29.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.23.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.5.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.0.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.15.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.3.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.27.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.21.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.14.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.13.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.8.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.31.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.25.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.19.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.24.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.22.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.17.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.11.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.7.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.4.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.28.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.16.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.6.mlp.experts.weight1 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.2.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.10.mlp.experts.weight2 INFO:megatron.core.distributed.param_and_grad_buffer: module.decoder.layers.9.mlp.experts.weight1 INFO:megatron.core.optimizer:Setting up optimizer with OptimizerConfig(optimizer='adam', lr=0.0001, min_lr=1e-05, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=True, timers=<megatron.core.timers.Timers object at 0x2b366837d3f0>) > learning rate decay style: cosine loading release checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg could not find arguments in the checkpoint ... checkpoint version 0 succesfully fixed query-key-values ordering for checkpoint version 0 successfully loaded checkpoint from custom/ckpt/mixtral-8x7b-tp2-ep8-mgg [ t 0, p 0 ] at iteration 0 > setting tensorboard ... (min, max) time across ranks (ms): load-checkpoint ................................: (8126.15, 8126.65) [after model, optimizer, and learning rate scheduler are built] datetime: 2024-04-06 02:55:06 > building train, validation, and test datasets ... > datasets target sizes (minimum size): train: 12800 validation: 128 test: 128 INFO:megatron.core.datasets.blended_megatron_dataset_config:mock = False INFO:megatron.core.datasets.blended_megatron_dataset_config:Let split_matrix = [(0, 0.9999), (0.9999, 0.99998), (0.99998, 1.0)] > building train, validation, and test datasets for GPT ... INFO:megatron.core.datasets.indexed_dataset:Load the _IndexReader from custom/data/wudao/wudao_mistralbpe_content_document.idx INFO:megatron.core.datasets.indexed_dataset: Extract the sequence lengths INFO:megatron.core.datasets.indexed_dataset: Extract the sequence pointers INFO:megatron.core.datasets.indexed_dataset: Extract the document indices INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 59132211 INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 59132211 INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset train indices INFO:megatron.core.datasets.gpt_dataset: Load the document index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-document_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the sample index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-sample_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from cc3235b81bd7fd0fa07cabe05d15043d-GPTDataset-shuffle_index.npy INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 40201537 INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset valid indices INFO:megatron.core.datasets.gpt_dataset: Load the document index from a625518736b8143e22f4f34c6682183e-GPTDataset-document_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the sample index from a625518736b8143e22f4f34c6682183e-GPTDataset-sample_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from a625518736b8143e22f4f34c6682183e-GPTDataset-shuffle_index.npy INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 6204 INFO:megatron.core.datasets.gpt_dataset:Load the GPTDataset test indices INFO:megatron.core.datasets.gpt_dataset: Load the document index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-document_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the sample index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-sample_index.npy INFO:megatron.core.datasets.gpt_dataset: Load the shuffle index from 052434ed70ae721ed70b2219cf2deb88-GPTDataset-shuffle_index.npy INFO:megatron.core.datasets.gpt_dataset:> total number of samples: 2332 > finished creating GPT datasets ... [after dataloaders are built] datetime: 2024-04-06 02:55:07 done with setup ... (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (8592.94, 8605.02) train/valid/test-data-iterators-setup ..........: (569.02, 865.21) training ... [before the start of training step] datetime: 2024-04-06 02:55:07 Number of parameters in transformer layers in billions: 46.44 Number of parameters in embedding layers in billions: 0.26 Total number of parameters in billions: 46.70 Number of parameters in most loaded shard in billions: 23.3510 Theoretical memory footprints: weight and optimizer=167019.40 MB [Rank 0] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61470.0 | max reserved: 61470.0 [2024-04-06 02:55:39] iteration 1/ 100 | consumed samples: 128 | elapsed time per iteration (ms): 32269.4 | throughput per GPU (TFLOP/s/GPU): 40.5 | learning rate: 2.000000E-07 | global batch size: 128 | lm loss: 1.985617E+00 | load_balancing_loss: 1.089786E+00 | loss scale: 1.0 | grad norm: 6.396 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [Rank 1] (after 1 iterations) memory (MB) | allocated: 54250.97802734375 | max allocated: 54250.98583984375 | reserved: 61480.0 | max reserved: 61480.0 [2024-04-06 02:55:45] iteration 2/ 100 | consumed samples: 256 | elapsed time per iteration (ms): 5630.1 | throughput per GPU (TFLOP/s/GPU): 231.9 | learning rate: 4.000000E-07 | global batch size: 128 | lm loss: 2.021530E+00 | load_balancing_loss: 1.087362E+00 | loss scale: 1.0 | grad norm: 6.895 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:55:50] iteration 3/ 100 | consumed samples: 384 | elapsed time per iteration (ms): 5410.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 6.000000E-07 | global batch size: 128 | lm loss: 2.003316E+00 | load_balancing_loss: 1.085377E+00 | loss scale: 1.0 | grad norm: 6.603 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:55:55] iteration 4/ 100 | consumed samples: 512 | elapsed time per iteration (ms): 5364.1 | throughput per GPU (TFLOP/s/GPU): 243.5 | learning rate: 8.000000E-07 | global batch size: 128 | lm loss: 2.009657E+00 | load_balancing_loss: 1.091695E+00 | loss scale: 1.0 | grad norm: 6.619 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:01] iteration 5/ 100 | consumed samples: 640 | elapsed time per iteration (ms): 5496.7 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.000000E-06 | global batch size: 128 | lm loss: 2.002326E+00 | load_balancing_loss: 1.091539E+00 | loss scale: 1.0 | grad norm: 6.612 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:06] iteration 6/ 100 | consumed samples: 768 | elapsed time per iteration (ms): 5364.8 | throughput per GPU (TFLOP/s/GPU): 243.4 | learning rate: 1.200000E-06 | global batch size: 128 | lm loss: 1.933151E+00 | load_balancing_loss: 1.086472E+00 | loss scale: 1.0 | grad norm: 5.765 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:12] iteration 7/ 100 | consumed samples: 896 | elapsed time per iteration (ms): 5682.7 | throughput per GPU (TFLOP/s/GPU): 229.8 | learning rate: 1.400000E-06 | global batch size: 128 | lm loss: 2.016085E+00 | load_balancing_loss: 1.085193E+00 | loss scale: 1.0 | grad norm: 5.821 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:17] iteration 8/ 100 | consumed samples: 1024 | elapsed time per iteration (ms): 5408.6 | throughput per GPU (TFLOP/s/GPU): 241.4 | learning rate: 1.600000E-06 | global batch size: 128 | lm loss: 1.965713E+00 | load_balancing_loss: 1.080933E+00 | loss scale: 1.0 | grad norm: 4.774 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:23] iteration 9/ 100 | consumed samples: 1152 | elapsed time per iteration (ms): 5590.1 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.800000E-06 | global batch size: 128 | lm loss: 1.919308E+00 | load_balancing_loss: 1.089582E+00 | loss scale: 1.0 | grad norm: 4.267 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:28] iteration 10/ 100 | consumed samples: 1280 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 2.000000E-06 | global batch size: 128 | lm loss: 1.978377E+00 | load_balancing_loss: 1.089948E+00 | loss scale: 1.0 | grad norm: 4.069 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:34] iteration 11/ 100 | consumed samples: 1408 | elapsed time per iteration (ms): 5984.1 | throughput per GPU (TFLOP/s/GPU): 218.2 | learning rate: 2.200000E-06 | global batch size: 128 | lm loss: 1.889895E+00 | load_balancing_loss: 1.083618E+00 | loss scale: 1.0 | grad norm: 3.361 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:40] iteration 12/ 100 | consumed samples: 1536 | elapsed time per iteration (ms): 5821.8 | throughput per GPU (TFLOP/s/GPU): 224.3 | learning rate: 2.400000E-06 | global batch size: 128 | lm loss: 1.932808E+00 | load_balancing_loss: 1.085315E+00 | loss scale: 1.0 | grad norm: 3.336 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:46] iteration 13/ 100 | consumed samples: 1664 | elapsed time per iteration (ms): 5962.2 | throughput per GPU (TFLOP/s/GPU): 219.0 | learning rate: 2.600000E-06 | global batch size: 128 | lm loss: 1.911683E+00 | load_balancing_loss: 1.079515E+00 | loss scale: 1.0 | grad norm: 3.183 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:52] iteration 14/ 100 | consumed samples: 1792 | elapsed time per iteration (ms): 5927.4 | throughput per GPU (TFLOP/s/GPU): 220.3 | learning rate: 2.800000E-06 | global batch size: 128 | lm loss: 1.913695E+00 | load_balancing_loss: 1.076165E+00 | loss scale: 1.0 | grad norm: 2.994 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:56:58] iteration 15/ 100 | consumed samples: 1920 | elapsed time per iteration (ms): 5926.4 | throughput per GPU (TFLOP/s/GPU): 220.4 | learning rate: 3.000000E-06 | global batch size: 128 | lm loss: 1.957101E+00 | load_balancing_loss: 1.069903E+00 | loss scale: 1.0 | grad norm: 2.853 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:04] iteration 16/ 100 | consumed samples: 2048 | elapsed time per iteration (ms): 5912.7 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 3.200000E-06 | global batch size: 128 | lm loss: 1.915763E+00 | load_balancing_loss: 1.065748E+00 | loss scale: 1.0 | grad norm: 2.778 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:10] iteration 17/ 100 | consumed samples: 2176 | elapsed time per iteration (ms): 5706.3 | throughput per GPU (TFLOP/s/GPU): 228.9 | learning rate: 3.400000E-06 | global batch size: 128 | lm loss: 1.918353E+00 | load_balancing_loss: 1.064678E+00 | loss scale: 1.0 | grad norm: 2.911 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:15] iteration 18/ 100 | consumed samples: 2304 | elapsed time per iteration (ms): 5732.8 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 3.600000E-06 | global batch size: 128 | lm loss: 1.861051E+00 | load_balancing_loss: 1.058054E+00 | loss scale: 1.0 | grad norm: 2.449 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:21] iteration 19/ 100 | consumed samples: 2432 | elapsed time per iteration (ms): 5684.9 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 3.800000E-06 | global batch size: 128 | lm loss: 1.934895E+00 | load_balancing_loss: 1.049081E+00 | loss scale: 1.0 | grad norm: 2.447 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:27] iteration 20/ 100 | consumed samples: 2560 | elapsed time per iteration (ms): 5770.6 | throughput per GPU (TFLOP/s/GPU): 226.3 | learning rate: 4.000000E-06 | global batch size: 128 | lm loss: 1.932632E+00 | load_balancing_loss: 1.052491E+00 | loss scale: 1.0 | grad norm: 2.456 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:32] iteration 21/ 100 | consumed samples: 2688 | elapsed time per iteration (ms): 5541.8 | throughput per GPU (TFLOP/s/GPU): 235.6 | learning rate: 4.200000E-06 | global batch size: 128 | lm loss: 1.904877E+00 | load_balancing_loss: 1.047207E+00 | loss scale: 1.0 | grad norm: 2.213 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:38] iteration 22/ 100 | consumed samples: 2816 | elapsed time per iteration (ms): 5576.7 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 4.400000E-06 | global batch size: 128 | lm loss: 1.872380E+00 | load_balancing_loss: 1.039512E+00 | loss scale: 1.0 | grad norm: 2.116 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:44] iteration 23/ 100 | consumed samples: 2944 | elapsed time per iteration (ms): 5807.4 | throughput per GPU (TFLOP/s/GPU): 224.9 | learning rate: 4.600000E-06 | global batch size: 128 | lm loss: 1.835408E+00 | load_balancing_loss: 1.042104E+00 | loss scale: 1.0 | grad norm: 2.034 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:50] iteration 24/ 100 | consumed samples: 3072 | elapsed time per iteration (ms): 5727.3 | throughput per GPU (TFLOP/s/GPU): 228.0 | learning rate: 4.800000E-06 | global batch size: 128 | lm loss: 1.898657E+00 | load_balancing_loss: 1.029742E+00 | loss scale: 1.0 | grad norm: 1.982 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:57:55] iteration 25/ 100 | consumed samples: 3200 | elapsed time per iteration (ms): 5498.4 | throughput per GPU (TFLOP/s/GPU): 237.5 | learning rate: 5.000000E-06 | global batch size: 128 | lm loss: 1.904866E+00 | load_balancing_loss: 1.034888E+00 | loss scale: 1.0 | grad norm: 1.872 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:01] iteration 26/ 100 | consumed samples: 3328 | elapsed time per iteration (ms): 5531.7 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 5.200000E-06 | global batch size: 128 | lm loss: 1.889752E+00 | load_balancing_loss: 1.028931E+00 | loss scale: 1.0 | grad norm: 1.793 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:06] iteration 27/ 100 | consumed samples: 3456 | elapsed time per iteration (ms): 5678.3 | throughput per GPU (TFLOP/s/GPU): 230.0 | learning rate: 5.400000E-06 | global batch size: 128 | lm loss: 1.866109E+00 | load_balancing_loss: 1.031736E+00 | loss scale: 1.0 | grad norm: 1.773 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:12] iteration 28/ 100 | consumed samples: 3584 | elapsed time per iteration (ms): 5650.6 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 5.600000E-06 | global batch size: 128 | lm loss: 1.914117E+00 | load_balancing_loss: 1.027364E+00 | loss scale: 1.0 | grad norm: 1.709 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:18] iteration 29/ 100 | consumed samples: 3712 | elapsed time per iteration (ms): 5912.1 | throughput per GPU (TFLOP/s/GPU): 220.9 | learning rate: 5.800000E-06 | global batch size: 128 | lm loss: 1.867856E+00 | load_balancing_loss: 1.023825E+00 | loss scale: 1.0 | grad norm: 1.769 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:23] iteration 30/ 100 | consumed samples: 3840 | elapsed time per iteration (ms): 5571.1 | throughput per GPU (TFLOP/s/GPU): 234.4 | learning rate: 6.000000E-06 | global batch size: 128 | lm loss: 1.924535E+00 | load_balancing_loss: 1.025294E+00 | loss scale: 1.0 | grad norm: 1.572 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:29] iteration 31/ 100 | consumed samples: 3968 | elapsed time per iteration (ms): 5718.9 | throughput per GPU (TFLOP/s/GPU): 228.3 | learning rate: 6.200000E-06 | global batch size: 128 | lm loss: 1.830754E+00 | load_balancing_loss: 1.028048E+00 | loss scale: 1.0 | grad norm: 1.555 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:35] iteration 32/ 100 | consumed samples: 4096 | elapsed time per iteration (ms): 5629.3 | throughput per GPU (TFLOP/s/GPU): 232.0 | learning rate: 6.400000E-06 | global batch size: 128 | lm loss: 1.848776E+00 | load_balancing_loss: 1.021549E+00 | loss scale: 1.0 | grad norm: 1.592 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:40] iteration 33/ 100 | consumed samples: 4224 | elapsed time per iteration (ms): 5600.4 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 6.600000E-06 | global batch size: 128 | lm loss: 1.917658E+00 | load_balancing_loss: 1.032319E+00 | loss scale: 1.0 | grad norm: 1.519 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:46] iteration 34/ 100 | consumed samples: 4352 | elapsed time per iteration (ms): 5643.8 | throughput per GPU (TFLOP/s/GPU): 231.4 | learning rate: 6.800000E-06 | global batch size: 128 | lm loss: 1.844636E+00 | load_balancing_loss: 1.019185E+00 | loss scale: 1.0 | grad norm: 1.626 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:51] iteration 35/ 100 | consumed samples: 4480 | elapsed time per iteration (ms): 5367.8 | throughput per GPU (TFLOP/s/GPU): 243.3 | learning rate: 7.000000E-06 | global batch size: 128 | lm loss: 1.853418E+00 | load_balancing_loss: 1.020990E+00 | loss scale: 1.0 | grad norm: 1.760 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:58:57] iteration 36/ 100 | consumed samples: 4608 | elapsed time per iteration (ms): 5399.9 | throughput per GPU (TFLOP/s/GPU): 241.8 | learning rate: 7.200000E-06 | global batch size: 128 | lm loss: 1.842918E+00 | load_balancing_loss: 1.023077E+00 | loss scale: 1.0 | grad norm: 1.409 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:02] iteration 37/ 100 | consumed samples: 4736 | elapsed time per iteration (ms): 5515.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 7.400000E-06 | global batch size: 128 | lm loss: 1.862270E+00 | load_balancing_loss: 1.023782E+00 | loss scale: 1.0 | grad norm: 1.718 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:08] iteration 38/ 100 | consumed samples: 4864 | elapsed time per iteration (ms): 5477.8 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 7.600000E-06 | global batch size: 128 | lm loss: 1.862543E+00 | load_balancing_loss: 1.019304E+00 | loss scale: 1.0 | grad norm: 1.722 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:13] iteration 39/ 100 | consumed samples: 4992 | elapsed time per iteration (ms): 5649.1 | throughput per GPU (TFLOP/s/GPU): 231.2 | learning rate: 7.800000E-06 | global batch size: 128 | lm loss: 1.863421E+00 | load_balancing_loss: 1.017805E+00 | loss scale: 1.0 | grad norm: 1.469 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:19] iteration 40/ 100 | consumed samples: 5120 | elapsed time per iteration (ms): 5810.4 | throughput per GPU (TFLOP/s/GPU): 224.8 | learning rate: 8.000000E-06 | global batch size: 128 | lm loss: 1.879655E+00 | load_balancing_loss: 1.017568E+00 | loss scale: 1.0 | grad norm: 1.633 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:25] iteration 41/ 100 | consumed samples: 5248 | elapsed time per iteration (ms): 5462.9 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 8.200000E-06 | global batch size: 128 | lm loss: 1.812076E+00 | load_balancing_loss: 1.020508E+00 | loss scale: 1.0 | grad norm: 1.419 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:30] iteration 42/ 100 | consumed samples: 5376 | elapsed time per iteration (ms): 5452.3 | throughput per GPU (TFLOP/s/GPU): 239.5 | learning rate: 8.400000E-06 | global batch size: 128 | lm loss: 1.824542E+00 | load_balancing_loss: 1.017472E+00 | loss scale: 1.0 | grad norm: 1.400 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:36] iteration 43/ 100 | consumed samples: 5504 | elapsed time per iteration (ms): 5444.9 | throughput per GPU (TFLOP/s/GPU): 239.8 | learning rate: 8.600000E-06 | global batch size: 128 | lm loss: 1.825991E+00 | load_balancing_loss: 1.019746E+00 | loss scale: 1.0 | grad norm: 1.426 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:41] iteration 44/ 100 | consumed samples: 5632 | elapsed time per iteration (ms): 5533.8 | throughput per GPU (TFLOP/s/GPU): 236.0 | learning rate: 8.800000E-06 | global batch size: 128 | lm loss: 1.875063E+00 | load_balancing_loss: 1.020033E+00 | loss scale: 1.0 | grad norm: 1.327 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:47] iteration 45/ 100 | consumed samples: 5760 | elapsed time per iteration (ms): 5718.6 | throughput per GPU (TFLOP/s/GPU): 228.4 | learning rate: 9.000000E-06 | global batch size: 128 | lm loss: 1.834162E+00 | load_balancing_loss: 1.018004E+00 | loss scale: 1.0 | grad norm: 1.611 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:52] iteration 46/ 100 | consumed samples: 5888 | elapsed time per iteration (ms): 5567.2 | throughput per GPU (TFLOP/s/GPU): 234.6 | learning rate: 9.200000E-06 | global batch size: 128 | lm loss: 1.883577E+00 | load_balancing_loss: 1.016062E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 02:59:58] iteration 47/ 100 | consumed samples: 6016 | elapsed time per iteration (ms): 5692.2 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 9.400000E-06 | global batch size: 128 | lm loss: 1.836727E+00 | load_balancing_loss: 1.019520E+00 | loss scale: 1.0 | grad norm: 1.372 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:04] iteration 48/ 100 | consumed samples: 6144 | elapsed time per iteration (ms): 5872.4 | throughput per GPU (TFLOP/s/GPU): 222.4 | learning rate: 9.600000E-06 | global batch size: 128 | lm loss: 1.855191E+00 | load_balancing_loss: 1.017754E+00 | loss scale: 1.0 | grad norm: 1.508 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:09] iteration 49/ 100 | consumed samples: 6272 | elapsed time per iteration (ms): 5528.7 | throughput per GPU (TFLOP/s/GPU): 236.2 | learning rate: 9.800000E-06 | global batch size: 128 | lm loss: 1.806294E+00 | load_balancing_loss: 1.017504E+00 | loss scale: 1.0 | grad norm: 1.529 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:15] iteration 50/ 100 | consumed samples: 6400 | elapsed time per iteration (ms): 5531.5 | throughput per GPU (TFLOP/s/GPU): 236.1 | learning rate: 1.000000E-05 | global batch size: 128 | lm loss: 1.887587E+00 | load_balancing_loss: 1.016094E+00 | loss scale: 1.0 | grad norm: 1.439 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:21] iteration 51/ 100 | consumed samples: 6528 | elapsed time per iteration (ms): 5501.3 | throughput per GPU (TFLOP/s/GPU): 237.4 | learning rate: 1.020000E-05 | global batch size: 128 | lm loss: 1.834414E+00 | load_balancing_loss: 1.015084E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:26] iteration 52/ 100 | consumed samples: 6656 | elapsed time per iteration (ms): 5520.9 | throughput per GPU (TFLOP/s/GPU): 236.5 | learning rate: 1.040000E-05 | global batch size: 128 | lm loss: 1.847078E+00 | load_balancing_loss: 1.015950E+00 | loss scale: 1.0 | grad norm: 1.486 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:32] iteration 53/ 100 | consumed samples: 6784 | elapsed time per iteration (ms): 5711.6 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.060000E-05 | global batch size: 128 | lm loss: 1.862840E+00 | load_balancing_loss: 1.016317E+00 | loss scale: 1.0 | grad norm: 1.522 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:37] iteration 54/ 100 | consumed samples: 6912 | elapsed time per iteration (ms): 5689.4 | throughput per GPU (TFLOP/s/GPU): 229.5 | learning rate: 1.080000E-05 | global batch size: 128 | lm loss: 1.897956E+00 | load_balancing_loss: 1.017408E+00 | loss scale: 1.0 | grad norm: 1.383 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:43] iteration 55/ 100 | consumed samples: 7040 | elapsed time per iteration (ms): 5763.8 | throughput per GPU (TFLOP/s/GPU): 226.6 | learning rate: 1.100000E-05 | global batch size: 128 | lm loss: 1.863309E+00 | load_balancing_loss: 1.014457E+00 | loss scale: 1.0 | grad norm: 1.534 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:49] iteration 56/ 100 | consumed samples: 7168 | elapsed time per iteration (ms): 5742.1 | throughput per GPU (TFLOP/s/GPU): 227.4 | learning rate: 1.120000E-05 | global batch size: 128 | lm loss: 1.899538E+00 | load_balancing_loss: 1.018558E+00 | loss scale: 1.0 | grad norm: 1.470 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:00:54] iteration 57/ 100 | consumed samples: 7296 | elapsed time per iteration (ms): 5450.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.140000E-05 | global batch size: 128 | lm loss: 1.864605E+00 | load_balancing_loss: 1.015150E+00 | loss scale: 1.0 | grad norm: 1.244 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:00] iteration 58/ 100 | consumed samples: 7424 | elapsed time per iteration (ms): 5538.9 | throughput per GPU (TFLOP/s/GPU): 235.8 | learning rate: 1.160000E-05 | global batch size: 128 | lm loss: 1.812579E+00 | load_balancing_loss: 1.020851E+00 | loss scale: 1.0 | grad norm: 1.610 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:05] iteration 59/ 100 | consumed samples: 7552 | elapsed time per iteration (ms): 5410.9 | throughput per GPU (TFLOP/s/GPU): 241.3 | learning rate: 1.180000E-05 | global batch size: 128 | lm loss: 1.848337E+00 | load_balancing_loss: 1.013638E+00 | loss scale: 1.0 | grad norm: 1.351 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:11] iteration 60/ 100 | consumed samples: 7680 | elapsed time per iteration (ms): 5603.1 | throughput per GPU (TFLOP/s/GPU): 233.1 | learning rate: 1.200000E-05 | global batch size: 128 | lm loss: 1.801180E+00 | load_balancing_loss: 1.019084E+00 | loss scale: 1.0 | grad norm: 1.549 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:16] iteration 61/ 100 | consumed samples: 7808 | elapsed time per iteration (ms): 5495.5 | throughput per GPU (TFLOP/s/GPU): 237.6 | learning rate: 1.220000E-05 | global batch size: 128 | lm loss: 1.813972E+00 | load_balancing_loss: 1.014779E+00 | loss scale: 1.0 | grad norm: 1.427 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:22] iteration 62/ 100 | consumed samples: 7936 | elapsed time per iteration (ms): 5753.1 | throughput per GPU (TFLOP/s/GPU): 227.0 | learning rate: 1.240000E-05 | global batch size: 128 | lm loss: 1.808689E+00 | load_balancing_loss: 1.022012E+00 | loss scale: 1.0 | grad norm: 1.398 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:28] iteration 63/ 100 | consumed samples: 8064 | elapsed time per iteration (ms): 5650.1 | throughput per GPU (TFLOP/s/GPU): 231.1 | learning rate: 1.260000E-05 | global batch size: 128 | lm loss: 1.781526E+00 | load_balancing_loss: 1.013716E+00 | loss scale: 1.0 | grad norm: 1.494 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:33] iteration 64/ 100 | consumed samples: 8192 | elapsed time per iteration (ms): 5539.7 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.280000E-05 | global batch size: 128 | lm loss: 1.871476E+00 | load_balancing_loss: 1.019044E+00 | loss scale: 1.0 | grad norm: 1.369 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:39] iteration 65/ 100 | consumed samples: 8320 | elapsed time per iteration (ms): 5493.9 | throughput per GPU (TFLOP/s/GPU): 237.7 | learning rate: 1.300000E-05 | global batch size: 128 | lm loss: 1.846450E+00 | load_balancing_loss: 1.017387E+00 | loss scale: 1.0 | grad norm: 1.308 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:44] iteration 66/ 100 | consumed samples: 8448 | elapsed time per iteration (ms): 5590.8 | throughput per GPU (TFLOP/s/GPU): 233.6 | learning rate: 1.320000E-05 | global batch size: 128 | lm loss: 1.873755E+00 | load_balancing_loss: 1.014257E+00 | loss scale: 1.0 | grad norm: 1.411 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:50] iteration 67/ 100 | consumed samples: 8576 | elapsed time per iteration (ms): 5710.3 | throughput per GPU (TFLOP/s/GPU): 228.7 | learning rate: 1.340000E-05 | global batch size: 128 | lm loss: 1.765591E+00 | load_balancing_loss: 1.016482E+00 | loss scale: 1.0 | grad norm: 1.414 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:01:56] iteration 68/ 100 | consumed samples: 8704 | elapsed time per iteration (ms): 5734.5 | throughput per GPU (TFLOP/s/GPU): 227.7 | learning rate: 1.360000E-05 | global batch size: 128 | lm loss: 1.839895E+00 | load_balancing_loss: 1.012786E+00 | loss scale: 1.0 | grad norm: 1.371 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:01] iteration 69/ 100 | consumed samples: 8832 | elapsed time per iteration (ms): 5478.6 | throughput per GPU (TFLOP/s/GPU): 238.4 | learning rate: 1.380000E-05 | global batch size: 128 | lm loss: 1.912256E+00 | load_balancing_loss: 1.013041E+00 | loss scale: 1.0 | grad norm: 1.485 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:07] iteration 70/ 100 | consumed samples: 8960 | elapsed time per iteration (ms): 5514.8 | throughput per GPU (TFLOP/s/GPU): 236.8 | learning rate: 1.400000E-05 | global batch size: 128 | lm loss: 1.873068E+00 | load_balancing_loss: 1.012509E+00 | loss scale: 1.0 | grad norm: 1.467 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:12] iteration 71/ 100 | consumed samples: 9088 | elapsed time per iteration (ms): 5361.6 | throughput per GPU (TFLOP/s/GPU): 243.6 | learning rate: 1.420000E-05 | global batch size: 128 | lm loss: 1.818812E+00 | load_balancing_loss: 1.013377E+00 | loss scale: 1.0 | grad norm: 1.300 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:18] iteration 72/ 100 | consumed samples: 9216 | elapsed time per iteration (ms): 5470.7 | throughput per GPU (TFLOP/s/GPU): 238.7 | learning rate: 1.440000E-05 | global batch size: 128 | lm loss: 1.820313E+00 | load_balancing_loss: 1.019612E+00 | loss scale: 1.0 | grad norm: 1.305 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:24] iteration 73/ 100 | consumed samples: 9344 | elapsed time per iteration (ms): 5829.9 | throughput per GPU (TFLOP/s/GPU): 224.0 | learning rate: 1.460000E-05 | global batch size: 128 | lm loss: 1.798953E+00 | load_balancing_loss: 1.010977E+00 | loss scale: 1.0 | grad norm: 1.539 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:29] iteration 74/ 100 | consumed samples: 9472 | elapsed time per iteration (ms): 5702.4 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.480000E-05 | global batch size: 128 | lm loss: 1.774078E+00 | load_balancing_loss: 1.012441E+00 | loss scale: 1.0 | grad norm: 1.471 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:35] iteration 75/ 100 | consumed samples: 9600 | elapsed time per iteration (ms): 5599.5 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.500000E-05 | global batch size: 128 | lm loss: 1.838492E+00 | load_balancing_loss: 1.015038E+00 | loss scale: 1.0 | grad norm: 1.445 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:40] iteration 76/ 100 | consumed samples: 9728 | elapsed time per iteration (ms): 5588.2 | throughput per GPU (TFLOP/s/GPU): 233.7 | learning rate: 1.520000E-05 | global batch size: 128 | lm loss: 1.860703E+00 | load_balancing_loss: 1.012689E+00 | loss scale: 1.0 | grad norm: 1.500 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:46] iteration 77/ 100 | consumed samples: 9856 | elapsed time per iteration (ms): 5425.4 | throughput per GPU (TFLOP/s/GPU): 240.7 | learning rate: 1.540000E-05 | global batch size: 128 | lm loss: 1.827507E+00 | load_balancing_loss: 1.012502E+00 | loss scale: 1.0 | grad norm: 1.491 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:52] iteration 78/ 100 | consumed samples: 9984 | elapsed time per iteration (ms): 5652.9 | throughput per GPU (TFLOP/s/GPU): 231.0 | learning rate: 1.560000E-05 | global batch size: 128 | lm loss: 1.784492E+00 | load_balancing_loss: 1.013809E+00 | loss scale: 1.0 | grad norm: 1.407 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:02:57] iteration 79/ 100 | consumed samples: 10112 | elapsed time per iteration (ms): 5577.0 | throughput per GPU (TFLOP/s/GPU): 234.2 | learning rate: 1.580000E-05 | global batch size: 128 | lm loss: 1.858489E+00 | load_balancing_loss: 1.011662E+00 | loss scale: 1.0 | grad norm: 1.621 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:03] iteration 80/ 100 | consumed samples: 10240 | elapsed time per iteration (ms): 5712.8 | throughput per GPU (TFLOP/s/GPU): 228.6 | learning rate: 1.600000E-05 | global batch size: 128 | lm loss: 1.842588E+00 | load_balancing_loss: 1.011640E+00 | loss scale: 1.0 | grad norm: 1.631 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:09] iteration 81/ 100 | consumed samples: 10368 | elapsed time per iteration (ms): 5684.5 | throughput per GPU (TFLOP/s/GPU): 229.7 | learning rate: 1.620000E-05 | global batch size: 128 | lm loss: 1.818980E+00 | load_balancing_loss: 1.012697E+00 | loss scale: 1.0 | grad norm: 1.564 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:14] iteration 82/ 100 | consumed samples: 10496 | elapsed time per iteration (ms): 5592.0 | throughput per GPU (TFLOP/s/GPU): 233.5 | learning rate: 1.640000E-05 | global batch size: 128 | lm loss: 1.805010E+00 | load_balancing_loss: 1.012805E+00 | loss scale: 1.0 | grad norm: 1.545 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:20] iteration 83/ 100 | consumed samples: 10624 | elapsed time per iteration (ms): 5641.6 | throughput per GPU (TFLOP/s/GPU): 231.5 | learning rate: 1.660000E-05 | global batch size: 128 | lm loss: 1.812314E+00 | load_balancing_loss: 1.011967E+00 | loss scale: 1.0 | grad norm: 1.530 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:25] iteration 84/ 100 | consumed samples: 10752 | elapsed time per iteration (ms): 5563.7 | throughput per GPU (TFLOP/s/GPU): 234.7 | learning rate: 1.680000E-05 | global batch size: 128 | lm loss: 1.822110E+00 | load_balancing_loss: 1.009684E+00 | loss scale: 1.0 | grad norm: 1.799 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:31] iteration 85/ 100 | consumed samples: 10880 | elapsed time per iteration (ms): 5580.9 | throughput per GPU (TFLOP/s/GPU): 234.0 | learning rate: 1.700000E-05 | global batch size: 128 | lm loss: 1.831795E+00 | load_balancing_loss: 1.009344E+00 | loss scale: 1.0 | grad norm: 1.578 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:37] iteration 86/ 100 | consumed samples: 11008 | elapsed time per iteration (ms): 5695.8 | throughput per GPU (TFLOP/s/GPU): 229.3 | learning rate: 1.720000E-05 | global batch size: 128 | lm loss: 1.831625E+00 | load_balancing_loss: 1.011533E+00 | loss scale: 1.0 | grad norm: 1.515 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:42] iteration 87/ 100 | consumed samples: 11136 | elapsed time per iteration (ms): 5444.5 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.740000E-05 | global batch size: 128 | lm loss: 1.814374E+00 | load_balancing_loss: 1.010052E+00 | loss scale: 1.0 | grad norm: 1.365 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:48] iteration 88/ 100 | consumed samples: 11264 | elapsed time per iteration (ms): 5462.7 | throughput per GPU (TFLOP/s/GPU): 239.1 | learning rate: 1.760000E-05 | global batch size: 128 | lm loss: 1.825778E+00 | load_balancing_loss: 1.010838E+00 | loss scale: 1.0 | grad norm: 1.506 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:53] iteration 89/ 100 | consumed samples: 11392 | elapsed time per iteration (ms): 5633.2 | throughput per GPU (TFLOP/s/GPU): 231.8 | learning rate: 1.780000E-05 | global batch size: 128 | lm loss: 1.818898E+00 | load_balancing_loss: 1.011014E+00 | loss scale: 1.0 | grad norm: 1.358 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:03:59] iteration 90/ 100 | consumed samples: 11520 | elapsed time per iteration (ms): 5567.8 | throughput per GPU (TFLOP/s/GPU): 234.5 | learning rate: 1.800000E-05 | global batch size: 128 | lm loss: 1.813602E+00 | load_balancing_loss: 1.022434E+00 | loss scale: 1.0 | grad norm: 1.590 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:04] iteration 91/ 100 | consumed samples: 11648 | elapsed time per iteration (ms): 5691.9 | throughput per GPU (TFLOP/s/GPU): 229.4 | learning rate: 1.820000E-05 | global batch size: 128 | lm loss: 1.797111E+00 | load_balancing_loss: 1.011964E+00 | loss scale: 1.0 | grad norm: 1.436 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:10] iteration 92/ 100 | consumed samples: 11776 | elapsed time per iteration (ms): 5451.5 | throughput per GPU (TFLOP/s/GPU): 239.6 | learning rate: 1.840000E-05 | global batch size: 128 | lm loss: 1.809117E+00 | load_balancing_loss: 1.012038E+00 | loss scale: 1.0 | grad norm: 1.577 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:15] iteration 93/ 100 | consumed samples: 11904 | elapsed time per iteration (ms): 5599.2 | throughput per GPU (TFLOP/s/GPU): 233.2 | learning rate: 1.860000E-05 | global batch size: 128 | lm loss: 1.797812E+00 | load_balancing_loss: 1.011838E+00 | loss scale: 1.0 | grad norm: 1.553 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:21] iteration 94/ 100 | consumed samples: 12032 | elapsed time per iteration (ms): 5443.7 | throughput per GPU (TFLOP/s/GPU): 239.9 | learning rate: 1.880000E-05 | global batch size: 128 | lm loss: 1.865515E+00 | load_balancing_loss: 1.013109E+00 | loss scale: 1.0 | grad norm: 1.603 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:26] iteration 95/ 100 | consumed samples: 12160 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 1.900000E-05 | global batch size: 128 | lm loss: 1.845348E+00 | load_balancing_loss: 1.012796E+00 | loss scale: 1.0 | grad norm: 1.599 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:32] iteration 96/ 100 | consumed samples: 12288 | elapsed time per iteration (ms): 5702.2 | throughput per GPU (TFLOP/s/GPU): 229.0 | learning rate: 1.920000E-05 | global batch size: 128 | lm loss: 1.843516E+00 | load_balancing_loss: 1.010116E+00 | loss scale: 1.0 | grad norm: 1.851 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:38] iteration 97/ 100 | consumed samples: 12416 | elapsed time per iteration (ms): 5733.2 | throughput per GPU (TFLOP/s/GPU): 227.8 | learning rate: 1.940000E-05 | global batch size: 128 | lm loss: 1.876754E+00 | load_balancing_loss: 1.011542E+00 | loss scale: 1.0 | grad norm: 1.748 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:43] iteration 98/ 100 | consumed samples: 12544 | elapsed time per iteration (ms): 5556.4 | throughput per GPU (TFLOP/s/GPU): 235.0 | learning rate: 1.960000E-05 | global batch size: 128 | lm loss: 1.810738E+00 | load_balancing_loss: 1.010371E+00 | loss scale: 1.0 | grad norm: 1.472 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:49] iteration 99/ 100 | consumed samples: 12672 | elapsed time per iteration (ms): 5523.5 | throughput per GPU (TFLOP/s/GPU): 236.4 | learning rate: 1.980000E-05 | global batch size: 128 | lm loss: 1.872008E+00 | load_balancing_loss: 1.008882E+00 | loss scale: 1.0 | grad norm: 1.681 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [2024-04-06 03:04:54] iteration 100/ 100 | consumed samples: 12800 | elapsed time per iteration (ms): 5540.0 | throughput per GPU (TFLOP/s/GPU): 235.7 | learning rate: 2.000000E-05 | global batch size: 128 | lm loss: 1.824753E+00 | load_balancing_loss: 1.009905E+00 | loss scale: 1.0 | grad norm: 1.625 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 | [after training is done] datetime: 2024-04-06 03:04:54
which modification brings the most speed improvement?
btw I encountered some error when converting mixtral from transformers to Megatron when grouped-gemm is set, can you share some converting scripts?
@ShinoharaHare Could you please share your checkpoint conversion script?
which modification brings the most speed improvement? btw I encountered some error when converting mixtral from transformers to Megatron when grouped-gemm is set, can you share some converting scripts?
The most significant performance change is achieved by resuming from a trained checkpoint. If you do not have pretrained weights, you can train from scratch for about 500 steps. We noticed that after several hundred steps, the token distribution will become quite balanced.
@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?
@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?
Hi Vlad, we are working on the converter; it is already in the review process.
Hi, I'm in a similar situation to this issue. But we also have some differences. For example, we use 8 h800, 64 experts, ep=8, tp=1, pp=1. I also encountered some training efficiency issues, but they were not a top priority.
What bothers me now is that after I used ep8 and grouped-gemm, my model structure changed.
when I try to merge the model with ep=8 into the model with ep=1, it can be loaded by the inference program normally, indicating that the merged shape is correct.
But the inference result is incorrect. I want to know if Megatron-LM will develop a model convert tool that can facilitate me to merge the ep=8 model into the ep=1 model.
Or could you provide some information on how to merge a grouped-gemm enabled model?
Hello @hwdef , thank you for the update. Currently, the format for the weights in GroupedGEMM for each expert is [input_size, output_size], which is different from the format used in SequentialMLP's ParallelLinear, [output_size, input_size]. Did you transpose the weight during your conversion? @cb521 can help to take a look if this issue continues.
By the way, we are also working on supporting distributed checkpointing with Grouped GEMM.
By the way, we are also working on supporting distributed checkpointing with Grouped GEMM.
Yes, we have considered the order of output_size and input_size
@yanring , @ShinoharaHare , can you please share a conversion script for Mixtral from HF weights ?
Hi Vlad, we are working on the converter; it is already in the review process.
I’m excited about this. When do you plan to merge it into the main branch?