epfLLM/Megatron-LLM

llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81

yushengsu-thu opened this issue · 2 comments

I follow here and use the same arguemnts:
https://epfllm.github.io/Megatron-LLM/guide/getting_started.html

When I training,

LOG_ARGS="--log_interval 1 --save_interval 100 --eval_interval 50"
TRAIN_ARGS="--train_iters 6500 --lr_decay_style cosine --lr_warmup_iters 650 --lr 2e-5 --min_lr 2e-6"
DISTRIBUTED_ARGS="--nproc_per_node $NUMBER_OF_GPUS_for_EACH_NODE --nnodes $NUMBER_OF_NODES --node_rank $NODE_ID --master_addr localhost --master_port 6000"
torchrun $DISTRIBUTED_ARGS ../finetune.py
--tensor_model_parallel_size 2
--pipeline_model_parallel_size 1
--load $LLM_LOAD_DIR
--save $LLM_SAVE_DIR
--tensorboard_dir $TENSORBOARD_DIR
--data_path $DATA_DIR
--model_name llama2
--tokenizer_type SentencePieceTokenizer
--vocab_file $VOCAB_PREFIX/tokenizer.model
--bf16
--use_flash_attn
--micro_batch_size 8
--global_batch_size 64
--sequence_parallel
--recompute_granularity selective
--use_checkpoint_args
--data_type instruction
--variable_seq_lengths
--vocab_extra_ids_list "<|im_start|>,<|im_end|>"
$COMMON_ARGS $LOG_ARGS $TRAIN_ARGS $LLAMA_ARGS

I encountered the following problem:
"llama2-7B AssertionError: padded_vocab_size value from checkpoint (32000) is not equal to the input argument value (32256) #81"

When I shared the model, I used 32000 true_vocab_size as you said in the torrential link (I also tried to remove it) but I still encounter the same error.

VOCAB_SIZE=32000
python3 ../tools/checkpoint_util.py
--target_tensor_parallel_size 2
--target_pipeline_parallel_size 1
--load_dir $LLM_LOAD_DIR
--save_dir $LLM_SAVE_SHARDED_DIR
--model_type llama2
--true_vocab_size $VOCAB_SIZE
--bf16

I have the same problem

add --no_new_tokens to your args