MLE-LLaMA: Multi-Language Enhanced LLaMA

This project aims to make LLaMa understand Chinese, and can generate fluency chinese. We are inspired that LLaMa have learned good English expression and a little alignment prompt can makes it capture Chinese.

Token vocabulary support for multi-language. We found that llama tokenizer naturally support for Chinese.
Fine-tuning llama script.

(1) download original ckpt from huggingface, and put them into file path ckpt.

(2) train.py original script must be run on 80G A100 and more techniques should be employed.

(3) train_lora.py lora fine-tuning using pert.

Argument Values

batch size 128 * 8

epochs 3

cut length 256

learning rate 2e-5

speed 1.02s / it
Fine-grained english-chinese alignment dataset. We colleced the high-quality English-Chinese pairs and can be download in google drive.

We also found that BELLE provide ckpts and chinese dataset, strongly recommended to refer it.
Instructing tuning. We use chinese alpaca and GuanacoDataset for instructing tunning.
Open source checkpoints, gradio scripts and cases. We found that LLaMA model tends to generate long sentences.

Argument	Values
`batch size`	128 * 8
`epochs`	3
`cut length`	256
`learning rate`	2e-5
`speed`	1.02s / it

Reference

[1] https://github.com/facebookresearch/llama

[2] https://github.com/tatsu-lab/stanford_alpaca

[3] https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

[4] https://github.com/tloen/alpaca-lora

[5] https://github.com/LianjiaTech/BELLE

gzhdy/MLE-LLaMA

MLE-LLaMA: Multi-Language Enhanced LLaMA

Reference