The model's performance is poor when using the merged tokenizer.

Question

The model's performance is poor when using the merged tokenizer.

adam-mhd94 opened this issue 6 months ago · 5 comments

adam-mhd94 commented 6 months ago

Check before submitting issues

Make sure to pull the latest code, as some issues and bugs have been fixed.
I have read the Wiki and FAQ section AND searched for similar issues and did not find a similar problem or solution
Third-party plugin issues - e.g., llama.cpp, LangChain, text-generation-webui, we recommend checking the corresponding project for solutions

Type of Issue

Model training and fine-tuning

Base Model

Chinese-LLaMA-2 (7B/13B)

Operating System

Linux

Describe your issue in detail

I intend to fine-tune the Lama 7 model with non-Chinese data. Training the model on large data with the original Lama tokenizer yields good results. However, when I use a tokenizer tailored for my language, the loss increases significantly, and the model performs very poorly. For example, it keeps repeating a single word or char.

GPUs: 6 16GB T4
I am training the model in a multi-GPU mode.

运行脚本前请仔细阅读wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh)

Read the wiki(https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/pt_scripts_zh) carefully before running the script

lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05

per_device_train_batch_size=1
gradient_accumulation_steps=1
block_size=32

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 torchrun --nnodes 1 --nproc_per_node 6 --master_port 5896 run_clm_pt_with_peft.py
--deepspeed ${deepspeed_config_file}
--model_name_or_path ${pretrained_model}
--tokenizer_name_or_path ${pretrained_model}
--dataset_dir ${dataset_dir}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.001
--per_device_train_batch_size ${per_device_train_batch_size}
--do_train
--seed $RANDOM
--num_train_epochs 1
--lr_scheduler_type cosine
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 2
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 16
--block_size ${block_size}
--output_dir ${output_dir}
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--modules_to_save ${modules_to_save}
--torch_dtype float32
--load_in_kbits 8
--save_safetensors False
--gradient_checkpointing
--ddp_find_unused_parameters False \

Dependencies (must be provided for code-related issues)

accelerate==0.27.2
aiofiles==23.2.1
aiohttp==3.9.3
aiosignal==1.3.1
altair==5.2.0
anyio==4.3.0
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.2.0
bitsandbytes==0.41.1
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
contourpy==1.2.0
cycler==0.12.1
datasets==2.14.5
deepspeed==0.11.0
dill==0.3.7
docker-pycreds==0.4.0
exceptiongroup==1.2.0
fastapi==0.109.2
ffmpy==0.3.2
filelock==3.13.1
fire==0.5.0
fonttools==4.49.0
frozenlist==1.4.1
fsspec==2023.6.0
gitdb==4.0.11
GitPython==3.1.42
gradio==3.50.2
gradio_client==0.6.1
h11==0.14.0
hjson==3.1.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.17.3
idna==3.6
importlib-resources==6.1.1
Jinja2==3.1.2
joblib==1.3.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.8.3
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.4
nvidia-cublas-cu11==11.11.3.6
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cudnn-cu11==8.7.0.84
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.3.0.86
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusparse-cu11==11.7.5.86
nvidia-nccl-cu11==2.19.3
nvidia-nvtx-cu11==11.8.86
orjson==3.9.14
packaging==23.2
pandas==2.2.0
pathtools==0.1.2
peft==0.3.0
pillow==10.2.0
protobuf==4.25.3
psutil==5.9.8
py-cpuinfo==9.0.0
pyarrow==15.0.0
pydantic==1.10.14
pydub==0.25.1
pyparsing==3.1.1
python-dateutil==2.8.2
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
referencing==0.33.0
regex==2023.12.25
requests==2.31.0
rpds-py==0.18.0
safetensors==0.4.2
scikit-learn==1.4.1.post1
scipy==1.11.1
semantic-version==2.10.0
sentencepiece==0.1.99
sentry-sdk==1.40.5
setproctitle==1.3.3
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
starlette==0.36.3
sympy==1.12
termcolor==2.4.0
threadpoolctl==3.3.0
tokenizers==0.14.1
toolz==0.12.1
torch==2.2.0+cu118
torchaudio==2.2.0+cu118
torchvision==0.17.0+cu118
tqdm==4.66.2
transformers==4.34.0
triton==2.2.0
typing_extensions==4.9.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.27.1
wandb==0.15.12
websockets==11.0.3
xxhash==3.4.1
yarl==1.9.4

Execution logs or screenshots

The model's output is such that it continuously repeats a word and is completely meaningless. Do you know where the problem might be coming from?

Answer 1 · 2024-03-12T00:33:24.000Z

There may be cases of underfitting.

Answer 2 · 2024-03-26T22:04:42.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Answer 3 · 2024-04-02T09:46:13.000Z

There may be cases of underfitting.

Thank you. Due to the 16GB memory(each GPU), I cannot increase the batch size. Could the issue possibly be due to a very small batch size?

Answer 4 · 2024-04-17T22:04:23.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Answer 5 · 2024-04-24T22:04:36.000Z

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.