llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256)
13416157913 opened this issue · 1 comments
hello, I run finetune llama2-7B meet the error:
Traceback (most recent call last):
File "/home/dengkaibiao/Megatron-LLM/finetune.py", line 261, in
pretrain(args, data_provider, model_provider, ModelType.encoder_or_decoder,
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 108, in pretrain
model, optimizer, opt_param_scheduler = _setup_model_and_optimizer(
File "/home/dengkaibiao/Megatron-LLM/megatron/training.py", line 371, in _setup_model_and_optimizer
args.iteration = load_checkpoint(model, optimizer, opt_param_scheduler)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 603, in load_checkpoint
check_checkpoint_args(checkpoint_args)
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 57, in check_checkpoint_args
_compare('padded_vocab_size')
File "/home/dengkaibiao/Megatron-LLM/megatron/checkpointing.py", line 49, in _compare
assert checkpoint_value == args_value, error_message
AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256).
================================================================================
this is my script:
export CUDA_DEVICE_MAX_CONNECTIONS=1
LOG_ARGS="--log_interval 1 --save_interval 10 --eval_interval 10"
TRAIN_ARGS="--train_iters 10 --lr_decay_style cosine --lr_warmup_iters 5 --lr 3e-4 --min_lr 1e-6"
DISTRIBUTED_ARGS="--nproc_per_node 2 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 8000"
COMMON_ARGS="--num_layers 32 --num_attention_heads 32 --seq_length 4096 --max_position_embeddings 4096 --ffn_hidden_size 11008
--hidden_dropout 0.0 --position_embedding_type rotary --no_bias_gelu_fusion
--no_bias_dropout_fusion --use_checkpoint_args
--attention_dropout 0.0 --adam_beta1 0.9 --adam_beta2 0.95 --adam_eps 1e-5
--layernorm_epsilon 1e-6
--weight_decay 0.1 --sequence_parallel --recompute_activations --recompute_granularity selective
--log_timers_to_tensorboard
--rope_scaling_factor 1.0"
#--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
export CUDA_VISIBLE_DEVICES=1,2
torchrun $DISTRIBUTED_ARGS finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--save /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2
--tensorboard_dir /home/dengkaibiao/Megatron-LLM-sharded-weights-7B-TP2/tensorboard/
--data_path /home/dengkaibiao/Megatron-LLM/corpus_indexed/china_text_document
--split 100,0,0
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file=/home/dengkaibiao/Llama-2-7b-hf/tokenizer.model
--make_vocab_size_divisible_by 1
--bf16
--global_batch_size 128
--micro_batch_size 1
--use_flash_attn
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS
The padded_vocab_size might have been modified when sharding the weights. Did you specify --true_vocab_size
? What command did you use to shard the weights?