Before you start running the project, you need to setup your environment following those steps.
conda create -n explanation python=3.10
conda activate explanation
git clone https://github.com/Strong-AI-Lab/Explanation-Generation.git
cd Explanation-Generation
pip install -r requirements.txt
We follow the fine-tuning steps from Stanford Alpaca to conduct instruction tunning on LLaMA-7B model and replicate the Alpaca-7B
https://github.com/tatsu-lab/stanford_alpaca#fine-tuning
Alpaca-7B: https://github.com/tatsu-lab/stanford_alpaca#recovering-alpaca-weights
Alpaca-13B: https://huggingface.co/chavinlo/alpaca-13b
Vicuna-7B: https://github.com/lm-sys/FastChat#vicuna-7b
Vicuna-13B: https://github.com/lm-sys/FastChat#vicuna-13b
GPT4-x-alpaca: https://huggingface.co/chavinlo/gpt4-x-alpaca
Generator means that we use the whole question including question stem, each option, answer as the input and the output is the explanation.
Data Format for generator:
Instruct: As an explanation generation expert, can you generate the explanation for the given input?
Input: Question, Option A, Option B, Option C, Option D, Option E, The correct answer
Output: Generated Explanation
To use the whole dataset for the training set, you can run the following command.
python data_preprocessing_generator.py
To use the Cardiff only average rating score >= 3 and the explanation length >=10 for the training set, you can run the following command.
python data_preprocessing_generator_one_dataset.py
Way 2 verifier means that we use the whole question including question stem, each option, answer and explanation as the input and the output is the question rating score. In this way, we avoid the assumption in way 1, while it may enlarge the length of the whole input. It is a more reasonable way at this stage.
Data Format for Way 2:
Instruct: As a question rating verifier expert, can you generate the question rating score for the given input?
Input: Question, Option A, Option B, Option C, Option D, Option E, Explanation
Output: Question average rating score
python data_preprocessing_verifier_way2.py
You need to convert the LLaMA into huggingface supported version before you run the script to do experiment.
## Convert the LLaMA-7B to LLaMA-7B huggingface model
python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir ../../LLaMA/7B \
--model_size 7B \
--output_dir llama_7B_hf
## Convert the LLaMA-13B to LLaMA-13B huggingface model
python transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir ../../LLaMA/13B \
--model_size 13B \
--output_dir llama_13B_hf
You can find the detail training script under training_script.sh
. In this file, it includes the commands for the following functions.
- Convert the LLaMA model from meta to the huggingface version.
- Instruction tunning for LLaMA-7B using 4 A100 80 GB GPUs to replicate the Alpaca-7B or you can download the weight for Alpaca-7B from here or other models' weights from as above shown.
- Train a generator using instruction tuning on new PeerWise dataset for using LLaMA-7B or Alpaca-7B (4 A100 80GB GPUs needed) and LLaMA-13B, Alpaca-13B or Vicuna-13B (8 A100 80GB GPUs needed).
- Train a verifier way 2 using instruction tuning on new PeerWise dataset for using LLaMA-7B or Alpaca-7B.
Here is an example for fine-tuning Vicuna-13B using Cardiff only average rating score >= 3 and the explanation length >=10 to train a generator. You need to have 8 A100 80GB GPUs.
## Fine-tuning the Vicuna-13B using Cardiff only avg >=3 and explanation length >=10 PeerWise dataset for explanation generator
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 --master_port=2026 train.py \
--model_name_or_path vicuna-13b \
--data_path ./Paul_new_data/Cardiff_generator_train_avg_3_lenexp_10.json \
--bf16 True \
--output_dir vicuna_13B_Cardiff_generator_avg_3_lenexp_10 \
--model_max_length 512 \
--num_train_epochs 5 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--gradient_checkpointing True
To run the program to interact with generator and verifier way 2, you can run the following code. The code will call the method in chat_generator.py
and chat_verifier_way2.py
.
python chat_explanation_verifier_way2.py
To batch evaluate the generator's generated explanation for Cardiff only, you can run the follwong command.
python batch_evaluation_Cardiff.py
- Save the models from different epochs and to see the explanation generation performance.
- Check the hyperparameter for model.generate function and to see how it will change the model output.
- Think about how to generate new data and teach model what explanation is better.
https://drive.google.com/file/d/1m7FLEvTJnjxjqNRxCNnjzweYoWn43k4x/view?usp=sharing
Thanks the great example from ChatDoctor which inspired us to develop the code to interact with user.