- torch
- transformers
- spacy
- LASER
- vecalign
- peft
- bitsandbytes
- part of the code is based on alpaca-lora
Only the URLs of Chinese books will be provided due to license issues. In this way, the aligned sentence mapping between English books and Chinese books will be uploaded so that you can get the Chinese data yourself.
pending...
train (N x A100 GPUs of 40G):
python -m torch.distributed.launch --nproc_per_node 8 --nnodes=1 --node_rank=0 finetune.py \
--micro_batch_size 4 \
--batch_size 128 \
--output_dir 'personet_model_save/ddp_8gpus (your own path)' \
--lora_r 8 \
--lora_alpha 16 \
--lora_target_modules '[q_proj,k_proj,v_proj,o_proj]' \
--num_epochs 5 \
--learning_rate 1e-4 \
--warmup_steps 170 \
--cutoff_len 1000 \
--eval_steps 340
generate on dev/test data (dev: full_dev_data.json; test: full_test_data.json; 1 A100 GPU of 40G):
python generate_new.py \
--load_8bit \
--lora_r 8 \
--lora_alpha 16 \
--lora_target_modules '[q_proj,k_proj,v_proj,o_proj]' \
--lora_weights '(your own path)' \
--eval_data_path 'full_test_data.json'
If you want to obtain longer history for an instance, it can be achieved by matching the given context to the original book texts from PG19 or the Gutenberg project directly.
For any questions, feel free to email us or create an issue and we will get back to you as soon as possible. Hope this repo is useful to your research.