PKU-DAIR/Hetu-Galvatron

Error when train galvatron with global mode CUDA error: uncorrectable ECC error encountered

Closed this issue · 2 comments

I set the environment variables as follow in train_dist.sh in gpt_hf folder:

export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
export MASTER_ADDR=localhost
export MASTER_PORT=2222
export NODE_RANK=0
export CUDA_DEVICE_MAX_CONNECTIONS=1

I used a cluster with 8 NVIDIA RTX A6000 and cuda version is release 11.8, V11.8.89
Then I encoutered error as follow:

and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
using world size: 8, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
setting global batch size to 8
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adam_weight_decay ............................... 0.01
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  allow_tf32 ...................................... 1
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  check_loss ...................................... 0
  chunks .......................................... 8
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  default_dp_type ................................. zero2
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  dropout_prob .................................... 0.1
  embed_sdp ....................................... 0
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. None
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  epochs .......................................... 10
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_after_profiling ............................ 1
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 6400
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  galvatron_config_path ........................... None
  global_batch_size ............................... 8
  global_checkpoint ............................... 0
  global_tp_consec ................................ 1
  global_tp_deg ................................... 2
  global_train_batch_size ......................... 16
  gradient_accumulation_fusion .................... False
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1600
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  initialize_on_meta .............................. 0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 50
  lazy_mpu_init ................................... None
  load ............................................ None
  load_params ..................................... 0
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 512
  max_predictions_per_seq ......................... 20
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mixed_precision ................................. bf16
  model_size ...................................... gpt-0.3b
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_hidden_layers ............................... 12
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  pipeline_type ................................... pipedream_flush
  position_embedding_type ......................... learned_absolute
  pp_deg .......................................... 2
  profile ......................................... 1
  profile_forward ................................. 0
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  profile_type .................................... allocated
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_verify_neighbor_count ..................... True
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  save_profiled_memory ............................ 0
  scatter_gather_tensors_in_pipeline .............. True
  sdp ............................................. 0
  seed ............................................ 1234
  seq_length ...................................... 2048
  sequence_parallel ............................... True
  set_layernum_manually ........................... 0
  set_model_config_manually ....................... 0
  sgd_momentum .................................... 0.9
  shape_order ..................................... SBH
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. BertWordPieceLowerCase
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... 50257
  vocab_tp ........................................ 4
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 8
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
> initializing torch distributed ...
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
GPT2Config {
  "activation_function": "gelu_new",
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "head_dim": 64,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_embd": 1024,
  "n_head": 16,
  "n_inner": null,
  "n_layer": 24,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "transformers_version": "4.41.2",
  "use_cache": false,
  "vocab_size": 50257
}

------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adam_weight_decay ............................... 0.01
  add_bias_linear ................................. True
  add_position_embedding .......................... True
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  allow_tf32 ...................................... 1
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... False
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... False
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_binary_head ................................ True
  bert_embedder_type .............................. megatron
  bert_load ....................................... None
  bf16 ............................................ False
  bias_dropout_fusion ............................. True
  bias_gelu_fusion ................................ True
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  check_for_nan_in_loss_and_grad .................. True
  check_loss ...................................... 0
  chunks .......................................... 8
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  clone_scatter_output_in_embedding ............... True
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  context_parallel_size ........................... 1
  data_cache_path ................................. None
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 8
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  default_dp_type ................................. zero2
  delay_grad_reduce ............................... True
  delay_param_gather .............................. False
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  distributed_timeout_minutes ..................... 10
  dropout_prob .................................... 0.1
  embed_sdp ....................................... 0
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 24
  encoder_seq_length .............................. None
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  epochs .......................................... 10
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_after_profiling ............................ 1
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_on_missing_checkpoint ...................... False
  exit_signal_handler ............................. False
  expert_model_parallel_size ...................... 1
  ffn_hidden_size ................................. 6400
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8 ............................................. None
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  galvatron_config_path ........................... None
  global_batch_size ............................... 8
  global_checkpoint ............................... 0
  global_tp_consec ................................ 1
  global_tp_deg ................................... 2
  global_train_batch_size ......................... 16
  gradient_accumulation_fusion .................... False
  group_query_attention ........................... False
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 1024
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  initialize_on_meta .............................. 0
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 64
  lazy_mpu_init ................................... None
  load ............................................ None
  load_params ..................................... 0
  local_rank ...................................... 0
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_learning_rate_to_tensorboard ................ True
  log_loss_scale_to_tensorboard ................... True
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_throughput .................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. 0.0001
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_init .................................. 0.0
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  manual_gc ....................................... False
  manual_gc_eval .................................. True
  manual_gc_interval .............................. 0
  mask_factor ..................................... 1.0
  mask_prob ....................................... 0.15
  mask_type ....................................... random
  masked_softmax_fusion ........................... True
  max_position_embeddings ......................... 512
  max_predictions_per_seq ......................... 20
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mixed_precision ................................. bf16
  model_size ...................................... gpt-0.3b
  nccl_communicator_config_path ................... None
  no_load_optim ................................... None
  no_load_rng ..................................... None
  no_persist_layer_norm ........................... False
  no_save_optim ................................... None
  no_save_rng ..................................... None
  norm_epsilon .................................... 1e-05
  normalization ................................... LayerNorm
  num_attention_heads ............................. 16
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_experts ..................................... None
  num_hidden_layers ............................... 24
  num_layers ...................................... 24
  num_layers_per_virtual_pipeline_stage ........... None
  num_query_groups ................................ 1
  num_workers ..................................... 2
  onnx_safe ....................................... None
  openai_gelu ..................................... False
  optimizer ....................................... adam
  output_bert_embeddings .......................... False
  overlap_grad_reduce ............................. False
  overlap_p2p_comm ................................ False
  overlap_param_gather ............................ False
  override_opt_param_scheduler .................... False
  padded_vocab_size ............................... 50304
  params_dtype .................................... torch.float32
  patch_dim ....................................... 16
  perform_initialization .......................... True
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  pipeline_type ................................... pipedream_flush
  position_embedding_type ......................... learned_absolute
  pp_deg .......................................... 2
  profile ......................................... 1
  profile_forward ................................. 0
  profile_ranks ................................... [0]
  profile_step_end ................................ 12
  profile_step_start .............................. 10
  profile_type .................................... allocated
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ None
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  retro_add_retriever ............................. False
  retro_cyclic_train_iters ........................ None
  retro_encoder_attention_dropout ................. 0.1
  retro_encoder_hidden_dropout .................... 0.1
  retro_encoder_layers ............................ 2
  retro_num_neighbors ............................. 2
  retro_num_retrieved_chunks ...................... 2
  retro_return_doc_ids ............................ False
  retro_verify_neighbor_count ..................... True
  retro_workdir ................................... None
  rotary_percent .................................. 1.0
  rotary_seq_len_interpolation_factor ............. None
  sample_rate ..................................... 1.0
  save ............................................ None
  save_interval ................................... None
  save_profiled_memory ............................ 0
  scatter_gather_tensors_in_pipeline .............. True
  sdp ............................................. 0
  seed ............................................ 1234
  seq_length ...................................... 1024
  sequence_parallel ............................... True
  set_layernum_manually ........................... 0
  set_model_config_manually ....................... 0
  sgd_momentum .................................... 0.9
  shape_order ..................................... SBH
  short_seq_prob .................................. 0.1
  skip_train ...................................... False
  spec ............................................ None
  split ........................................... 969, 30, 1
  squared_relu .................................... False
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  swiglu .......................................... False
  swin_backbone_type .............................. tiny
  tensor_model_parallel_size ...................... 1
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. BertWordPieceLowerCase
  tp_comm_bulk_dgrad .............................. True
  tp_comm_bulk_wgrad .............................. True
  tp_comm_overlap ................................. False
  tp_comm_overlap_cfg ............................. None
  tp_comm_split_ag ................................ True
  tp_comm_split_rs ................................ True
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  untie_embeddings_and_output_weights ............. False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. True
  use_mcore_models ................................ False
  use_one_sent_docs ............................... False
  use_ring_exchange_p2p ........................... False
  use_rotary_position_embeddings .................. False
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vision_backbone_type ............................ vit
  vision_pretraining .............................. False
  vision_pretraining_type ......................... classify
  vocab_extra_ids ................................. 0
  vocab_file ...................................... None
  vocab_size ...................................... 50257
  vocab_tp ........................................ 4
  wandb_exp_name .................................. 
  wandb_project ................................... 
  wandb_save_dir .................................. 
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 8
-------------------- end of arguments ---------------------
======================== Galvatron Parallel Config =============================
Galvatron parallel config mode: [GLOBAL config mode]
[GLOBAL config mode] Loaded global hybrid parallel strategy:
   global_batch_size: 16, chunks: 8
   pp_deg: 2, tp_deg: 2, dp_deg: 2, tp_consecutive_flag: 1, checkpoint_flag: 0
   pipeline_type: pipedream_flush, default_dp_type: zero2, dtype: bf16
   pp_division:                  [12, 12]
   pp_ranks:                     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
================================================================================
Creating Model...
Model Layer Types:
['embed', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'gpt_dec', 'norm', 'cls']
   tp_sizes_whole:               [4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4]
   tp_consec_whole:              [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   dp_types_whole:               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   pp_ranks_whole:               [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
   checkpoint_flags_whole:       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   dp_sizes_whole:               [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1]
================================================================================
====================== Galvatron Communication Group ===========================
TP groups for rank 5 (all layers):
[4, 5, 6, 7] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5] [4, 5, 6, 7] [4, 5, 6, 7] 
DP groups for rank 5 (all layers):
[5] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5, 7] [5] [5] 
Split groups for rank 5:
None [4, 5, 6, 7] None None None None None None None None None None None None None None None None None None None None None None None [4, 5] None 
AllGather groups for rank 5:
None [4, 5] None None None None None None None None None None None None None None None None None None None None None None None [4, 5, 6, 7] None 
Fused split groups for rank 5:
None [5, 7] None None None None None None None None None None None None None None None None None None None None None None None None None 
Fused allgather groups for rank 5:
None None None None None None None None None None None None None None None None None None None None None None None None None [5, 7] None 
================================================================================
Traceback (most recent call last):
  File "train_dist.py", line 109, in <module>
    train(args)
  File "train_dist.py", line 55, in train
    model = construct_hybrid_parallel_model(
  File "/home/wyr/Hetu-Galvatron/galvatron/models/gpt_hf/GPTModel_hybrid_parallel.py", line 12, in construct_hybrid_parallel_model
    hp_model = construct_hybrid_parallel_model_api(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/hybrid_parallel_model.py", line 138, in construct_hybrid_parallel_model_api
    hp_model.wrap_pipeline_modules_data_parallel(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/pipeline/pipeline.py", line 127, in wrap_pipeline_modules_data_parallel
    self.model_cur_stage = wrap_modules_data_parallel(
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 185, in wrap_modules_data_parallel
    module_list[i] = wrap_data_parallel(module_list[i], dp_types[i], dp_groups[i], module_type=module_types[i], pp_device = pp_device, mixed_precision=mixed_precision, pp_on=pp_on, wrap_block_name=wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 20, in wrap_data_parallel
    return wrap_module_fsdp_manually(module, pp_device, module_type, dp_group, fsdp_type=fsdp_type_dict[dp_type], mixed_precision=mixed_precision, pp_on=pp_on, wrap_block_name=wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 54, in wrap_module_fsdp_manually
    module = apply_fsdp(module, fsdp_args, wrap_block_name)
  File "/home/wyr/Hetu-Galvatron/galvatron/core/parallel.py", line 98, in apply_fsdp
    _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 370, in _recursive_wrap
    wrapped_child, num_wrapped_params = _recursive_wrap(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 388, in _recursive_wrap
    return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 317, in _wrap
    return wrapper_cls(module, **kwargs)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 408, in __init__
    _init_param_handle_from_module(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 415, in _init_param_handle_from_module
    _move_module_to_device(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 802, in _move_module_to_device
    module = module.to(device_from_device_id)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: uncorrectable ECC error encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
Creating Dataset...
/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py:295: UserWarning: FSDP is switching to use `NO_SHARD` instead of ShardingStrategy.SHARD_GRAD_OP since the world size is 1.
  warnings.warn(
After creating model [Allocated]
        Max memory: 298.26 MB   Current memory : 248.26 MB

Before Forward [Allocated]
        Max memory: 248.39 MB   Current memory : 248.39 MB
After creating model [Allocated]
        Max memory: 319.77 MB   Current memory : 252.25 MB
Start training...

Before Forward [Allocated]
        Max memory: 252.38 MB   Current memory : 252.38 MB
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943546 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943553 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943554 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943555 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943556 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943557 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2943562 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 6 (pid: 2943559) of binary: /home/wyr/anaconda3/envs/galvatron/bin/python3
Traceback (most recent call last):
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 196, in <module>
    main()
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 192, in main
    launch(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launch.py", line 177, in launch
    run(args)
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/wyr/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

Could you please help me resolve this issue? Or could you provide some possible solutions? Thank you for your help and support!

Hello, we've checked the same configuration on 8 NVIDIA RTX A100 and it works out fine. Could you please pull our latest codes and run again? Feel free to report! Thank you!

Hello, we've checked the same configuration on 8 NVIDIA RTX A100 and it works out fine. Could you please pull our latest codes and run again? Feel free to report! Thank you!

Thank you for your comment. The issue I encountered might be related to hardware failure. After carefully reviewing the code, I have successfully run the project.