Model Interaction Code
Create virtual environment
- setup the virtual environment
conda create -n msqa python=3.10 conda activate msqa git clone https://github.com/ModelInteraction/MSQA.git conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch pip install -r requirements.txt
Process pretrain Azure documentation
- direct to
pretrain_azure_doc/
and run the below commandline to download the azure documentation for pretrainchmod +x clone_repos.sh ./clone_repos.sh
- extract and rename markdown files and save to
pretrain_azure_doc/data/
python save_azure.py
- split the markwdown files into json file limited with max token length for pretrain, save json file in to
pretrain_azure_doc/azure_json_output/
python process_azure.py
Process MSQA data
Note we only show sample MSQA data and full data will be available when the paper is accepted.
- direct to
msqa_process/
- post process the msqa data collected from Microsoft Q&A forum
python post_process.py
- split and save to train and test json, they should be saved to
msqa_process/data/MSQA_train.json
andmsqa_process/data/MSQA_test.json
, respectively.python split.py
Pretrain and finetune
-
direct to
train/
-
pretrain with Azure documentation following the commandline with DeepSpeed
deepspeed train.py \ --model_name_or_path {YOUR_MODEL_PATH} \ --data_path {AZURE_JSON_PATH} \ --output_dir {PRETRAIN_MODEL_SAVE_PATH} \ --num_train_epochs 8 \ --model_max_length 512 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 4 \ --learning_rate 2e-5 \ --warmup_steps 2 \ --logging_steps 2 \ --lr_scheduler_type "cosine" \ --report_to "tensorboard" \ --gradient_checkpointing True \ --deepspeed configs/deepspeed_config.json \ --fp16 True
where {AZURE_JSON_PATH} is the path where you save processed azure documentation json
pretrain_azure_doc/azure_json_output/
-
finetune with MSQA train data previously saved in
msqa_process/data/MSQA_train.json
deepspeed train.py \ --model_name_or_path {PRETRAIN_MODEL_SAVE_PATH} \ --data_path {MSQA_TRAIN_JSON_PATH} \ --output_dir {FINETUNE_MODEL_SAVE_PATH} \ --num_train_epochs 5 \ --model_max_length 512 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 20 \ --save_total_limit 3 \ --learning_rate 1e-5 \ --warmup_steps 2 \ --logging_steps 2 \ --lr_scheduler_type "cosine" \ --report_to "tensorboard" \ --gradient_checkpointing True \ --deepspeed configs/deepspeed_config.json \ --fp16 True
Inference with finetuned model
- generate domain knowledge with our finetuned model with the commandline
python inference.py \ --base_model= {FINETUNE_MODEL_SAVE_PATH} \ --infer_ids_path= {QUESTION_ID_TO_INFERENCE} \ --save_path= {RESULT_SAVE_PATH} \ --batch_size=1 \ --max_new_tokens=512 \ --num_beams=4
Result generation and evaluation
- Once the domain-specific model output its response to the question, we perform LLM generation taking either our domain knowledge or the chunks from retrieval-based methods.
- You should save your OAI key in the
keybook.py
and the endpoint function of LLM is inllm_components.py
. - Standard metrics, including BLEU, ROUGE-1/2/L, METEOR, BERT-Score, SIM, are defined in
eval_metrics.py
. - Our proposed metrics
- CAR is defined in
is_no_answer
ineval_metrics.py
. - KHR is defined in
KHR.py
and keywords need to be extracted beforehand withkeyword_extract.py
. - LLM-based metrics is defined in
llm_eval.py
.
- CAR is defined in
result_generation.py
contains all prompts to generate baseline results given either domain knowledge from our model or chunks from retrieval-based methods.score_conflict.py
andconflict_stat_plot.py
is to get the conflict analysis from LLM-based metric and visualization, respectively.
Human evaluation UI
We also include the UI for human evaluators
- Direct to
ui/
- Setup the python virtual environment
conda create -n humaneval python=3.10 conda activate humaneval pip install -r requirements.txt
- put the data to be evaluated in
ui/human_eval_data/
- prepare the data
python preprocess_human_eval_data.py
- Run the UI
streamlit run qa_preference.py
Human evaluation analysis
Direct to human_annotataion/
- put the
.csv
files of each human evaluator tohuman_annotation/data/
- process the human evaluation
python annotation_process.py
- output statistics and plot results in
annotation_stats.py