epfLLM/Megatron-LLM

Getting started "shard" model not working

Closed this issue · 9 comments

First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration.
I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a Bus error (core dumped).

I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.

below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with

cd Megatron-LLM
pip install -r requirements.txt
cd megatron/data/
make
cd ../../

in the container.

Error Stack
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:103:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::Half; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:127:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::BFloat16; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:138:   required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                          ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                  ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  737 |       cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
      |                                                                                                                                                                                                                                                       ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                         ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  750 |       cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
      |                                                                                                                                                                              ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
  768 |     cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
      |                                                                                                                                 ^ 
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
  245 |   T * data() const {
      | ^ ~~
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF fused_weight_gradient_dense.o.d -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cpp -o fused_weight_gradient_dense.o 
[2/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -gencode arch=compute_80,code=sm_80 -std=c++17 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cu -o fused_weight_gradient_dense.cuda.o 
[3/3] c++ fused_weight_gradient_dense.o fused_weight_gradient_dense.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_dense_cuda.so
Loading extension module fused_dense_cuda...
Building model ...
/epfllm/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
/epfllm/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
  warnings.warn( "Llama is not intended to use dropout")
 loading release checkpoint from ./model
 checkpoint version 3.0
  successfully loaded checkpoint from ./model at iteration 0
using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 1 
setting global batch size to 1
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. True
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.1
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ............................ True
  bert_load ....................................... None
  bf16 ............................................ True
  bias_dropout_fusion ............................. False
  bias_gelu_fusion ................................ False
  biencoder_projection_dim ........................ 0
  biencoder_shared_query_context_model ............ False
  block_data_path ................................. None
  classes_fraction ................................ 1.0
  clip_grad ....................................... 1.0
  consumed_train_samples .......................... 0
  consumed_valid_samples .......................... 0
  data_impl ....................................... infer
  data_parallel_random_init ....................... False
  data_parallel_size .............................. 1
  data_path ....................................... None
  data_per_class_fraction ......................... 1.0
  data_sharding ................................... True
  dataloader_type ................................. single
  DDP_impl ........................................ local
  decoder_num_layers .............................. None
  decoder_seq_length .............................. None
  dino_bottleneck_size ............................ 256
  dino_freeze_last_layer .......................... 1
  dino_head_hidden_size ........................... 2048
  dino_local_crops_number ......................... 10
  dino_local_img_size ............................. 96
  dino_norm_last_layer ............................ False
  dino_teacher_temp ............................... 0.07
  dino_warmup_teacher_temp ........................ 0.04
  dino_warmup_teacher_temp_epochs ................. 30
  distribute_saved_activations .................... False
  distributed_backend ............................. nccl
  embedding_path .................................. None
  empty_unused_memory_level ....................... 0
  encoder_num_layers .............................. 32
  encoder_seq_length .............................. 4096
  end_weight_decay ................................ 0.01
  eod_mask_loss ................................... False
  eval_interval ................................... 1000
  eval_iters ...................................... 100
  evidence_data_path .............................. None
  exit_duration_in_mins ........................... None
  exit_interval ................................... None
  exit_signal_handler ............................. False
  ffn_hidden_size ................................. 11008
  finetune ........................................ False
  fp16 ............................................ False
  fp16_lm_cross_entropy ........................... False
  fp32_residual_connection ........................ False
  fp8_amax_compute_algo ........................... most_recent
  fp8_amax_history_len ............................ 1
  fp8_e4m3 ........................................ False
  fp8_hybrid ...................................... False
  fp8_interval .................................... 1
  fp8_margin ...................................... 0
  fp8_wgrad ....................................... True
  global_batch_size ............................... 1
  glu_activation .................................. swiglu
  gradient_accumulation_fusion .................... True
  head_lr_mult .................................... 1.0
  hidden_dropout .................................. 0.1
  hidden_size ..................................... 4096
  hysteresis ...................................... 2
  ict_head_size ................................... None
  ict_load ........................................ None
  img_h ........................................... 224
  img_w ........................................... 224
  indexer_batch_size .............................. 128
  indexer_log_interval ............................ 1000
  inference_batch_times_seqlen_threshold .......... 512
  init_method_std ................................. 0.02
  init_method_xavier_uniform ...................... False
  initial_loss_scale .............................. 4294967296
  iter_per_epoch .................................. 1250
  kv_channels ..................................... 128
  layernorm_epsilon ............................... 1e-05
  lima_dropout .................................... False
  load ............................................ None
  local_rank ...................................... None
  log_batch_size_to_tensorboard ................... False
  log_interval .................................... 100
  log_memory_to_tensorboard ....................... False
  log_num_zeros_in_grad ........................... False
  log_params_norm ................................. False
  log_timers_to_tensorboard ....................... False
  log_validation_ppl_to_tensorboard ............... False
  log_world_size_to_tensorboard ................... False
  loss_scale ...................................... None
  loss_scale_window ............................... 1000
  lr .............................................. None
  lr_decay_iters .................................. None
  lr_decay_samples ................................ None
  lr_decay_style .................................. linear
  lr_warmup_fraction .............................. None
  lr_warmup_iters ................................. 0
  lr_warmup_samples ............................... 0
  make_vocab_size_divisible_by .................... 128
  mask_prob ....................................... 0.15
  masked_softmax_fusion ........................... False
  max_position_embeddings ......................... 4096
  max_tokens_to_oom ............................... 12000
  merge_file ...................................... None
  metrics ......................................... []
  micro_batch_size ................................ 1
  min_loss_scale .................................. 1.0
  min_lr .......................................... 0.0
  mmap_warmup ..................................... False
  new_tokens ...................................... True
  no_load_optim ................................... True
  no_load_rng ..................................... True
  no_persist_layer_norm ........................... False
  no_save_optim ................................... True
  no_save_rng ..................................... True
  num_attention_heads ............................. 32
  num_attention_heads_kv .......................... 32
  num_channels .................................... 3
  num_classes ..................................... 1000
  num_layers ...................................... 32
  num_layers_per_virtual_pipeline_stage ........... None
  num_workers ..................................... 2
  onnx_safe ....................................... None
  optimizer ....................................... adam
  override_opt_param_scheduler .................... False
  parallel_attn ................................... False
  parallel_layernorm .............................. False
  params_dtype .................................... torch.bfloat16
  patch_dim ....................................... 16
  perform_initialization .......................... False
  pipeline_model_parallel_size .................... 1
  pipeline_model_parallel_split_rank .............. None
  position_embedding_type ......................... PositionEmbeddingType.rotary
  query_in_block_prob ............................. 0.1
  rampup_batch_size ............................... None
  rank ............................................ 0
  recompute_granularity ........................... None
  recompute_method ................................ None
  recompute_num_layers ............................ 1
  reset_attention_mask ............................ False
  reset_position_ids .............................. False
  retriever_report_topk_accuracies ................ []
  retriever_score_scaling ......................... False
  retriever_seq_length ............................ 256
  rope_scaling_factor ............................. 1.0
  rope_theta ...................................... 10000.0
  sample_rate ..................................... 1.0
  save ............................................ ./model_sharded
  save_interval ................................... 1
  scalar_loss_mask ................................ 0.0
  scatter_gather_tensors_in_pipeline .............. True
  seed ............................................ 1234
  seq_length ...................................... 4096
  sequence_parallel ............................... False
  sgd_momentum .................................... 0.9
  short_seq_prob .................................. 0.1
  skip_iters ...................................... []
  split ........................................... 969, 30, 1
  standalone_embedding_stage ...................... False
  start_weight_decay .............................. 0.01
  tensor_model_parallel_size ...................... 4
  tensorboard_dir ................................. None
  tensorboard_log_interval ........................ 1
  tensorboard_queue_size .......................... 1000
  test_data_path .................................. None
  tie_embed_logits ................................ False
  timing_log_level ................................ 0
  timing_log_option ............................... minmax
  titles_data_path ................................ None
  tokenizer_model ................................. None
  tokenizer_type .................................. SentencePieceTokenizer
  train_data_path ................................. None
  train_iters ..................................... None
  train_samples ................................... None
  transformer_impl ................................ local
  transformer_pipeline_model_parallel_size ........ 1
  use_bias ........................................ False
  use_checkpoint_args ............................. False
  use_checkpoint_opt_param_scheduler .............. False
  use_contiguous_buffers_in_local_ddp ............. True
  use_cpu_initialization .......................... True
  use_distributed_optimizer ....................... False
  use_flash_attn .................................. False
  use_one_sent_docs ............................... False
  use_post_ln ..................................... False
  use_ring_exchange_p2p ........................... False
  use_rms_norm .................................... True
  valid_data_path ................................. None
  variable_seq_lengths ............................ False
  virtual_pipeline_model_parallel_size ............ None
  vocab_extra_ids ................................. 0
  vocab_extra_ids_list ............................ None
  vocab_file ...................................... None
  wandb_api_key ................................... None
  wandb_entity .................................... meditron
  wandb_id ........................................ None
  wandb_logger .................................... False
  wandb_project ................................... None
  wandb_resume .................................... False
  weight_decay .................................... 0.01
  weight_decay_incr_style ......................... constant
  world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
sending lm_head
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
sending transformer layer 0
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
Bus error (core dumped)

Thank you for your interest in our project.

The Apex compilation warnings are expected, I have seen these since the beginning.

The warnings.warn( "Llama is not intended to use dropout") warnings are also fine. We should probably turn these off.

I replicate your problem when following the docs as written (also using a single node with 8x A100 80gb).

When I invoke docker with the additional arguments

--shm-size=128gb \
--ulimit memlock=-1 \ 
--ulimit stack=67108864 \
 --memory 480G 

however, it runs as expected. Please try something like this.

@AleHD please can you add this, or at least a mention that nontrivial memory is needed to shard the weights, to the "Getting Started" section? Thanks!

Thank you @kylematoba that solved it for me. I managed to shard the model but ran into a different issue during training.

Traceback (most recent call last):
  File "/epfllm/./Megatron-LLM/finetune.py", line 249, in <module>
    pretrain(args, data_provider, model_provider,  ModelType.encoder_or_decoder,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 138, in pretrain
    iteration = _train(args,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 678, in _train
    train_step(forward_step_func,
  File "/epfllm/Megatron-LLM/megatron/training.py", line 411, in train_step
    losses_reduced = forward_backward_func(
  File "/epfllm/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
    output_tensor = forward_step(forward_step_func, data_iterator,
  File "/epfllm/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
    output_tensor, loss_func = forward_step_func(data_iterator, model)
  File "/epfllm/./Megatron-LLM/finetune.py", line 213, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
    return self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/module.py", line 186, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
    lm_output = self.language_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
    mlp_output, mlp_bias = self.mlp(layernorm_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
    bias_gelu_impl(intermediate_parallel, bias_parallel)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
    return bias_gelu(bias, input)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in fallback_function
@torch.jit.script
def bias_gelu(bias, y):
    x = bias + y
        ~~~~~~~~ <--- HERE
    return  x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
RuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'

Hi, I'm guessing that it's an OOM that's obfuscated by the JIT-ing. In cases like this I can usually recommend commenting out the @torch.jit.script decorator to get a more helpful stack trace.

As far as I can see, you've not reported what sort of model you are trying to train. Did you look at https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements? Only the smallest models can fit into 8x A100s 80gb.

Let me try commenting out the scripting.

I am following the getting started so its Llama 2 7B and i have 8x A100 80GBs.

that's my command

LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 500 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
torchrun $DISTRIBUTED_ARGS ${MEGATRON_PATH}/finetune.py \
	--tensor_model_parallel_size 4 \
	--pipeline_model_parallel_size 1 \
	--load ${MODEL_PATH}_sharded \
	--save ${MODEL_PATH}_sharded \
	--tensorboard_dir ${MODEL_PATH}_sharded \
	--data_path ${DATASET_PATH}/megatron_text_document \
	--model_name llama2 \
	--tokenizer_type SentencePieceTokenizer \
	--vocab_file=${MODEL_PATH}/tokenizer.model \
	--bf16 \
	--use_flash_attn \
	--micro_batch_size 5 \
	--global_batch_size 1000 \
	--sequence_parallel \
	--recompute_granularity selective \
	--use_checkpoint_args \
	$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

The error is not really more helpful...

TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
    encoder_output = self.encoder(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
    hidden_states = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
    mlp_output, mlp_bias = self.mlp(layernorm_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
    bias_gelu_impl(intermediate_parallel, bias_parallel)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
    return bias_gelu(bias, input)
  File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in bias_gelu
    x = bias + y
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'

Should the getting started guide: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html work e2e?

Hi, thanks for that.

I'm pretty sure the problem is something that we overlooked early on: runs without --no_bias_gelu_fusion don't work. Please can you add that argument (like is done in the docs), and let me know how you get on?

I'll make sure this bug gets investigated in any case.

Addin --no_bias_gelu_fusion solved the issue and it is training. Thank you for your help! Will play with it more and hopefully publish a blog post on it.!

Thanks @philschmid. I'll close this and we'll fix the bug I mention above shortly.