Introduction

This repository contains python scripts that could help you extract dialogue from text files into a dataset, and then fine-tune the language model to mimic the way a specific character or person talks.

How to use

This repository only includes the code for dataset preparing and fine-tuning training, not the code for inference (generation) and interface. Please download all files to the same folder as the interface you are currently using. I strongly recommend text-generation-webui. If you use another interface, additional libraries and dependencies may need to be installed. For windows users, it may be necessary to install the linux subsystem.

Tested on pygmalion, should work for GPT-J and all its derived models. It should also be useful for models such as galactica, opt, llama, etc., and further testing is required.

Prepare the dialogue dataset

There is no limit to the number of samples, I tested around 80 and 380 samples (meaning the original text document has 80 or 380 lines belonging to the selected characters), but models fine tuning on larger dataset usually perform better. Dialogues should be contained in a text file named dataset.txt. The format is as follows:

Here is a sample dialogue from Alan Turing's COMPUTING MACHINERY AND INTELLIGENCE. The text file can also contain some non-dialogue lines like this if you think it helpful to the content of the dialogue, but the dialogue lines should always start with the character's name.
Alice:In the first line of your sonnet which reads ‘Shall I compare thee to a summer's day’, would not ‘a spring day’ do as well or better?
Bob:It wouldn’t scan.
Alice:How about ‘a winter's day’ That would scan all right.
Bob:Yes, but nobody wants to be compared to a winter's day.
Alice:Would you say Mr. Pickwick reminded you of Christmas?
Bob:In a way.

Alice:(You can put some actions or ideas in parentheses, asterisks are also fine, but I prefer parentheses:))Yet Christmas is a winter's day, and I do not think Mr. Pickwick would mind the comparison.
Bob:I don’t think you’re serious. By a winter's flay one means a typical winter's day, rather than a special one like Christmas.

dataset.txt needs to be saved in the same directory as prepare_dataset.py.

For those who use text-generation-webui, use this command:

conda activate textgen
cd text-generation-webui
python prepare_dataset.py --name Bob

Please replace "Bob" with your character's name. And if you don't use text-generation-webui as the interface, just activate your own environment and replace "text-generation-webui" with your own folder name or path. This will convert the raw dialogue in the text file to a dataset in json format named processed_dataset.json.

Finetune the model

If there are no "models" and "loras" folders, you need to create them, and download the model to be trained into the "models" folder. If these two folders and models already exist, there is no need to create them again.

For a quick start, use this command:

python lora_finetune.py -m pygmalion-6b

please replace "pygmalion-6b" with the model you want to train. Just copy and paste the folder name of the model you selected in the "models" folder.

Some details

Some additional parameters can be passed via the command line. For prepare_dataset.py:

'--name', '-n', help="character's name", required=True, type=str
'--rounds', '-r', help='context rounds', default=5, type=int
'--txtfilename', '-t', help='rare dataset', default='dataset.txt', type=str
'--jsonfilename', '-j', help='processed dataset', default='processed_dataset.json', type=str

"rounds" determines how many lines the context of each dialogue sample contains. If you modify "textfilename" and "jsonfilename", you need to modify the associated items as well, such as "dataset_name" in lora_finetune.py.

For lora_finetune.py:

'--dataset_name', '-d', default='processed_dataset.json', type=str
'--model_name', '-m', default='pygmalion-6b', type=str
'--lora_name', '-l', default='lora_test', type=str
'--batch_size', '-b', default=4, type=int
'--epochs', '-e', default=10, type=int
'--cutoff_len', '-c', default=256, type=int
'--lora_rank', '-r', default=16, type=int

Increasing batch_size and cutoff_len can make training more stable (convergence speed will be slightly slower), but if your gpu memory is small, you may only be able to choose a lower value😥. Higher epochs and lora_rank values can make the content generated by the fine-tuned model more like the content in the dialogue training set you provided, but higher is not necessarily better.

If you have experience in fine tuning, you can go to lora_finetune.py to modify other parameters, but the default values should just work fine.

load lora adapter

If the interactive interface you are currently using already supports lora(you may need to load the model in 8bit and set eos token as "\n"(for text-generation-webui user, tick "Stop generating at new line character?" in Parameters)), please refer to its instructions. Else you may need to manually edit the text generating code. Maybe something like ...

From

model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map=device_map)

from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,load_in_8bit=True,device_map=device_map)
lora_name_or_path = "loras/lora_test"
model = PeftModel.from_pretrained(model, lora_name_or_path, device_map=device_map)

Please replace "loras/lora_test" with your own lora name or path.

TO DO

Parallel Context Windows

Medium and Long Term Memory