Getting started "shard" model not working
Closed this issue · 9 comments
First, of all thank you for creating this project! It looks very exciting and interesting due its close Hugging Face Integration.
I am very curious and wanted to give it a try following the Getting Started Guide in the documentation. But i ran into an error during the "Model Sharding" resulting into a Bus error (core dumped)
.
I am running on a single Node 8x A100 80GB with 1TB of memory. I followed the exact same step in the guide and used the container.
below is the full error stack in case its helpful. It includes quite a lot of weird C errors/warning in the beginning. I installed the package with
cd Megatron-LLM
pip install -r requirements.txt
cd megatron/data/
make
cd ../../
in the container.
Error Stack
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:95: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = float; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:103: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::Half; U = float; V = c10::Half]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:127: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::Half]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu: In instantiation of ‘void HostLayerNormGradient(const V*, const U*, const U*, at::Tensor*, int, int, const V*, const V*, double, T*, V*, V*) [with T = c10::BFloat16; U = float; V = c10::BFloat16]’:
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:800:138: required from here
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:138: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:210: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:737:247: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
737 | cuComputePartGradGammaBeta<<<blocks2, threads2, nshared2, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:137: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:750:174: warning: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
750 | cuComputeGradGammaBeta<<<blocks3, threads3, nshared3, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
/epfllm/Megatron-LLM/megatron/fused_kernels/layer_norm_cuda_kernel.cu:768:129: warning: ‘T* at::Tensor::data() const [with T = c10::BFloat16]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Wdeprecated-declarations]
768 | cuComputeGradInput<<<blocks1, threads1, nshared, stream>>>(
| ^
/usr/local/lib/python3.10/dist-packages/torch/include/ATen/core/TensorBody.h:245:1: note: declared here
245 | T * data() const {
| ^ ~~
[3/3] c++ layer_norm_cuda.o layer_norm_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_mix_prec_layer_norm_cuda.so
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF fused_weight_gradient_dense.o.d -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cpp -o fused_weight_gradient_dense.o
[2/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_dense_cuda -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 --use_fast_math -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -gencode arch=compute_80,code=sm_80 -std=c++17 -c /epfllm/Megatron-LLM/megatron/fused_kernels/fused_weight_gradient_dense.cu -o fused_weight_gradient_dense.cuda.o
[3/3] c++ fused_weight_gradient_dense.o fused_weight_gradient_dense.cuda.o -shared -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_dense_cuda.so
Loading extension module fused_dense_cuda...
Building model ...
/epfllm/Megatron-LLM/megatron/model/llama_model.py:38: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
/epfllm/Megatron-LLM/megatron/model/llama_model.py:40: UserWarning: Llama is not intended to use dropout
warnings.warn( "Llama is not intended to use dropout")
loading release checkpoint from ./model
checkpoint version 3.0
successfully loaded checkpoint from ./model at iteration 0
using world size: 4, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 1
setting global batch size to 1
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. True
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
async_tensor_model_parallel_allreduce ........... True
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
barrier_with_L1_time ............................ True
bert_load ....................................... None
bf16 ............................................ True
bias_dropout_fusion ............................. False
bias_gelu_fusion ................................ False
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
classes_fraction ................................ 1.0
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... infer
data_parallel_random_init ....................... False
data_parallel_size .............................. 1
data_path ....................................... None
data_per_class_fraction ......................... 1.0
data_sharding ................................... True
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_num_layers .............................. None
decoder_seq_length .............................. None
dino_bottleneck_size ............................ 256
dino_freeze_last_layer .......................... 1
dino_head_hidden_size ........................... 2048
dino_local_crops_number ......................... 10
dino_local_img_size ............................. 96
dino_norm_last_layer ............................ False
dino_teacher_temp ............................... 0.07
dino_warmup_teacher_temp ........................ 0.04
dino_warmup_teacher_temp_epochs ................. 30
distribute_saved_activations .................... False
distributed_backend ............................. nccl
embedding_path .................................. None
empty_unused_memory_level ....................... 0
encoder_num_layers .............................. 32
encoder_seq_length .............................. 4096
end_weight_decay ................................ 0.01
eod_mask_loss ................................... False
eval_interval ................................... 1000
eval_iters ...................................... 100
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
exit_signal_handler ............................. False
ffn_hidden_size ................................. 11008
finetune ........................................ False
fp16 ............................................ False
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
fp8_amax_compute_algo ........................... most_recent
fp8_amax_history_len ............................ 1
fp8_e4m3 ........................................ False
fp8_hybrid ...................................... False
fp8_interval .................................... 1
fp8_margin ...................................... 0
fp8_wgrad ....................................... True
global_batch_size ............................... 1
glu_activation .................................. swiglu
gradient_accumulation_fusion .................... True
head_lr_mult .................................... 1.0
hidden_dropout .................................. 0.1
hidden_size ..................................... 4096
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_h ........................................... 224
img_w ........................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
inference_batch_times_seqlen_threshold .......... 512
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
iter_per_epoch .................................. 1250
kv_channels ..................................... 128
layernorm_epsilon ............................... 1e-05
lima_dropout .................................... False
load ............................................ None
local_rank ...................................... None
log_batch_size_to_tensorboard ................... False
log_interval .................................... 100
log_memory_to_tensorboard ....................... False
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
log_world_size_to_tensorboard ................... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. None
lr_decay_iters .................................. None
lr_decay_samples ................................ None
lr_decay_style .................................. linear
lr_warmup_fraction .............................. None
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... False
max_position_embeddings ......................... 4096
max_tokens_to_oom ............................... 12000
merge_file ...................................... None
metrics ......................................... []
micro_batch_size ................................ 1
min_loss_scale .................................. 1.0
min_lr .......................................... 0.0
mmap_warmup ..................................... False
new_tokens ...................................... True
no_load_optim ................................... True
no_load_rng ..................................... True
no_persist_layer_norm ........................... False
no_save_optim ................................... True
no_save_rng ..................................... True
num_attention_heads ............................. 32
num_attention_heads_kv .......................... 32
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 32
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
optimizer ....................................... adam
override_opt_param_scheduler .................... False
parallel_attn ................................... False
parallel_layernorm .............................. False
params_dtype .................................... torch.bfloat16
patch_dim ....................................... 16
perform_initialization .......................... False
pipeline_model_parallel_size .................... 1
pipeline_model_parallel_split_rank .............. None
position_embedding_type ......................... PositionEmbeddingType.rotary
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
recompute_granularity ........................... None
recompute_method ................................ None
recompute_num_layers ............................ 1
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
rope_scaling_factor ............................. 1.0
rope_theta ...................................... 10000.0
sample_rate ..................................... 1.0
save ............................................ ./model_sharded
save_interval ................................... 1
scalar_loss_mask ................................ 0.0
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 4096
sequence_parallel ............................... False
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
skip_iters ...................................... []
split ........................................... 969, 30, 1
standalone_embedding_stage ...................... False
start_weight_decay .............................. 0.01
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
test_data_path .................................. None
tie_embed_logits ................................ False
timing_log_level ................................ 0
timing_log_option ............................... minmax
titles_data_path ................................ None
tokenizer_model ................................. None
tokenizer_type .................................. SentencePieceTokenizer
train_data_path ................................. None
train_iters ..................................... None
train_samples ................................... None
transformer_impl ................................ local
transformer_pipeline_model_parallel_size ........ 1
use_bias ........................................ False
use_checkpoint_args ............................. False
use_checkpoint_opt_param_scheduler .............. False
use_contiguous_buffers_in_local_ddp ............. True
use_cpu_initialization .......................... True
use_distributed_optimizer ....................... False
use_flash_attn .................................. False
use_one_sent_docs ............................... False
use_post_ln ..................................... False
use_ring_exchange_p2p ........................... False
use_rms_norm .................................... True
valid_data_path ................................. None
variable_seq_lengths ............................ False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_extra_ids_list ............................ None
vocab_file ...................................... None
wandb_api_key ................................... None
wandb_entity .................................... meditron
wandb_id ........................................ None
wandb_logger .................................... False
wandb_project ................................... None
wandb_resume .................................... False
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
world_size ...................................... 4
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 1
Setting consumed_train_samples to 0 and consumed_valid_samples to 0
sending embeddings
sending lm_head
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
sending transformer layer 0
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /epfllm/Megatron-LLM/megatron/fused_kernels/build/build.ninja...
Building extension module fused_dense_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_dense_cuda...
Bus error (core dumped)
Thank you for your interest in our project.
The Apex compilation warnings are expected, I have seen these since the beginning.
The warnings.warn( "Llama is not intended to use dropout")
warnings are also fine. We should probably turn these off.
I replicate your problem when following the docs as written (also using a single node with 8x A100 80gb).
When I invoke docker with the additional arguments
--shm-size=128gb \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--memory 480G
however, it runs as expected. Please try something like this.
@AleHD please can you add this, or at least a mention that nontrivial memory is needed to shard the weights, to the "Getting Started" section? Thanks!
Thank you @kylematoba that solved it for me. I managed to shard the model but ran into a different issue during training.
Traceback (most recent call last):
File "/epfllm/./Megatron-LLM/finetune.py", line 249, in <module>
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/epfllm/Megatron-LLM/megatron/training.py", line 138, in pretrain
iteration = _train(args,
File "/epfllm/Megatron-LLM/megatron/training.py", line 678, in _train
train_step(forward_step_func,
File "/epfllm/Megatron-LLM/megatron/training.py", line 411, in train_step
losses_reduced = forward_backward_func(
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 234, in forward_backward_no_pipelining
output_tensor = forward_step(forward_step_func, data_iterator,
File "/epfllm/Megatron-LLM/megatron/schedules.py", line 117, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
File "/epfllm/./Megatron-LLM/finetune.py", line 213, in forward_step
output_tensor = model(tokens, position_ids, attention_mask,
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/distributed.py", line 58, in forward
return self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/module.py", line 186, in forward
outputs = self.module(*inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/gpt_model.py", line 87, in forward
lm_output = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/language_model.py", line 512, in forward
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in fallback_function
@torch.jit.script
def bias_gelu(bias, y):
x = bias + y
~~~~~~~~ <--- HERE
return x * 0.5 * (1.0 + torch.tanh(0.79788456 * x * (1 + 0.044715 * x * x)))
RuntimeError: Expected a proper Tensor but got None (or an undefined Tensor in C++) for argument #0 'self'
Hi, I'm guessing that it's an OOM that's obfuscated by the JIT-ing. In cases like this I can usually recommend commenting out the @torch.jit.script
decorator to get a more helpful stack trace.
As far as I can see, you've not reported what sort of model you are trying to train. Did you look at https://epfllm.github.io/Megatron-LLM/guide/faq.html#what-are-the-basic-hardware-requirements? Only the smallest models can fit into 8x A100s 80gb.
Let me try commenting out the scripting.
I am following the getting started so its Llama 2 7B and i have 8x A100 80GBs.
that's my command
LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 500 --lr_decay_style cosine --lr_warmup_iters 50 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
torchrun $DISTRIBUTED_ARGS ${MEGATRON_PATH}/finetune.py \
--tensor_model_parallel_size 4 \
--pipeline_model_parallel_size 1 \
--load ${MODEL_PATH}_sharded \
--save ${MODEL_PATH}_sharded \
--tensorboard_dir ${MODEL_PATH}_sharded \
--data_path ${DATASET_PATH}/megatron_text_document \
--model_name llama2 \
--tokenizer_type SentencePieceTokenizer \
--vocab_file=${MODEL_PATH}/tokenizer.model \
--bf16 \
--use_flash_attn \
--micro_batch_size 5 \
--global_batch_size 1000 \
--sequence_parallel \
--recompute_granularity selective \
--use_checkpoint_args \
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS
The error is not really more helpful...
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
encoder_output = self.encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 1239, in forward
hidden_states = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 802, in forward
mlp_output, mlp_bias = self.mlp(layernorm_output)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(*args, **kwargs)
File "/epfllm/Megatron-LLM/megatron/model/transformer.py", line 131, in forward
bias_gelu_impl(intermediate_parallel, bias_parallel)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 35, in forward
return bias_gelu(bias, input)
File "/epfllm/Megatron-LLM/megatron/model/fused_bias_gelu.py", line 16, in bias_gelu
x = bias + y
TypeError: unsupported operand type(s) for +: 'NoneType' and 'Tensor'
Should the getting started guide: https://epfllm.github.io/Megatron-LLM/guide/getting_started.html work e2e?
Hi, thanks for that.
I'm pretty sure the problem is something that we overlooked early on: runs without --no_bias_gelu_fusion
don't work. Please can you add that argument (like is done in the docs), and let me know how you get on?
I'll make sure this bug gets investigated in any case.
Addin --no_bias_gelu_fusion
solved the issue and it is training. Thank you for your help! Will play with it more and hopefully publish a blog post on it.!
Thanks @philschmid. I'll close this and we'll fix the bug I mention above shortly.