[ English | 中文 ]
Important
Installation is mandatory.
git clone https://github.com/JL-er/RWKV-PEFT.git
cd RWKV-PEFT
pip install -r requirements.txt
Tip
If you are using a cloud server (such as Vast or AutoDL), you can start the Streamlit service by referring to the help documentation on the cloud server's official website.
streamlit run web/home.py
- infctx
- fla --fla
- State tuning
- Quant(QPissa,QLora) --quant int8/nf4
- Bone
- Pissa
- Lora
- dataload(get、pad、only)
Consider the memory requirements for training the following models with an 4090 24GB GPU with 64GB of CPU RAM.(--strategy deepspeed_stage_1 --ctx_len 1024 --micro_bsz 1 --lora_r 64)
Model | Full Finetuning | lora/pissa | Qlora/Qpissa | State tuning |
---|---|---|---|---|
RWKV6-1.6B | OOM GPU | 7.4GB GPU | 5.6GB GPU | 6.4GB GPU |
RWKV6-3B | OOM GPU | 12.1GB GPU | 8.2GB GPU | 9.4GB GPU |
RWKV6-7B | OOM GPU | 23.7GB GPU(bsz 8 OOM) | 14.9GB GPU(bsz 8 need 19.5GB) | 18.1GB GPU |
- strategy deepspeed_stage_1
- ctx_len 1024
- micro_bsz 1
- 4090 24G
Model | bf16 | int8 | nf4/fp4/4bit |
---|---|---|---|
RWKV6-1.6B | 6.1GB GPU | 4.7GB GPU | 4.1GB GPU |
RWKV6-3B | 9.1GB GPU | 6.5GB GPU | 5.2GB GPU |
RWKV6-7B | 17.8GB GPU | 11.9GB GPU | 8.5GB GPU |
RWKV6-14B | xxGB GPU | xxGB GPU | xxGB GPU |
According to Necessary Dependencies
pip install -r requirements.txt
Refer to the examples in the scripts folder to modify the paths and required parameters (for detailed data preparation, please refer to the official RWKV tutorial).
sh scripts/run_lora.sh
- peft
The parameter 'peft' contains multiple methods. Refer to the detailed description, choose the required method, and then configure the corresponding 'config'.
--peft bone --bone_config $bone_config
- train_parts
More freedom in choosing which parts to train, such as "emb", "head", "time", "ln". If you want to train only the k and v parts, you can simply set []. For regular fine-tuning users, it's not recommended to add this parameter. If not added, ["time", "ln"] will be included by default. These parts account for a small proportion of the total parameters.
--train_parts ["time", "ln"]
- Quant
When using PEFT or state tuning, you can use Quantized weights to reduce memory usage.
--quant int8/nf4
- infctx
RWKV series has a unique training method (infinite length training) that can be used with any fine-tuning method to prevent memory explosion caused by excessively long training data. ctx_len is the length you want to train (set according to the length of the training data). chunk_ctx should be adjusted appropriately based on memory, and chunk_ctx is obtained by slicing from ctx_len, so ensure that chunk_ctx is smaller than ctx_len. Add script parameters as follows:
--train_type infctx --chunk_ctx 512 --ctx_len 2048
- State tuning
RWKV's unique fine-tuning method has very low training overhead.
--train_type "state" --fla
- dataload
Supports different data sampling methods. By default, it uses get(RWKV-LM), which is a random sampling method that treats all data as a single sequence and randomly slices it according to ctx_len for easier parallel processing. pad and only are used to sample from the beginning of each piece of data. pad: Pads at the end. For example, if ctx_len is 1024 and the actual length of the current sampled data is 1000, it will pad with the first 24 tokens of the next piece of data at the end for easier parallel processing. only: Only supports cases where bsz=1. It sets ctx_len to be equal to the maximum sampling length. If the current sampled data length exceeds ctx_len, any excess part will be truncated.
--dataload pad
- loss_mask
Supports masking the question part in QA tasks as well as the padding at the end to prevent the model from memorizing answers based on questions, thereby enhancing the model's generalization ability.
--loss_mask qa/pad
- strategy
DeepSpeed memory allocation strategies: Prefer using strategy 1. When the model is large or during full fine-tuning, use strategy 2 or 3. If memory still overflows, use offload. Strategy 3 allows for model parallelism (a single model is split across multiple GPUs). deepspeed_stage_1 deepspeed_stage_2 deepspeed_stage_2_offload deepspeed_stage_3 deepspeed_stage_3_offload
deepspeed_stage_1
- ctx_len
Sample training length and adjust according to data length. As ctx_len increases, memory usage will also increase. - micro_bsz
If you find this repo useful, please consider citing our works:
@misc{kang2024boneblockaffinetransformation,
title={Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models},
author={Jiale Kang},
year={2024},
eprint={2409.15371},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15371},
}