CCL 2023 Best Paper: Rethinking Label Smoothing on Multi-hop Question
Welcome to the official repository for the paper: "Rethinking Label Smoothing on Multi-hop Question". Our work introduces a cutting-edge Multi-Hop Question Answering (MHQA) framework, Smoothing
Please make sure you have the following requirements installed:
- transformers>=4.20.0
- fastNLP==1.0.1
- jsonlines
- ipdb
- pandas
- torch
- ujson
The dataset required for this project is available through the official HotpotQA website. To facilitate seamless integration with the code, please follow these steps:
- Visit the HotpotQA dataset page and download the dataset.
- Create a folder named
HotpotQAData
in the root directory of this project, at the same level as thecode
folder. - Save the downloaded dataset files into the
HotpotQAData
folder.
We've streamlined the process for replicating our baseline models by providing starter scripts in main.py. Below you'll find detailed instructions for training both the Retriever and Reader models, along with a comprehensive explanation of the script parameters to customize your training runs.
Initiate training of the Retriever model using the command below:
python train.py --task RE --lr 5e-6 --batch-size 16 --accumulation-steps 1 --epoch 8 --seed 41 --re-model Electra
To start training the Reader model, use the following command:
python train.py --task QA --lr 2e-6 --batch-size 8 --accumulation-steps 2 --epoch 8 --seed 41 --qa-model Deberta
When running train.py
, you can customize your training with the following input arguments:
task
: Specifies the model to train - Retriever (RE
) or Reader (QA
).lr
: Sets the learning rate.batch-size
: Defines the batch size per step.accumulation-steps
: Determines how many steps to accumulate gradients for before updating model weights (Effective batch size =batch-size
*accumulation-steps
).epoch
: Number of training epochs.seed
: Random seed for reproducibility.re-model
/qa-model
: Chooses the backbone model for the Retriever (Electra
/Roberta
) or Reader (Deberta
/Roberta
) task.LDLA-decay-rate
: Specifies the decay rate for the LDLA algorithm.
Additional parameters like data-path
for specifying the dataset directory, and warmupsteps
for setting the number of warmup steps during training, are also supported but not detailed in the initial commands.
- GPU: Training is optimized for A100 GPUs. Adjust the random seed (
--seed
) based on your system's clock to ensure reproducibility. - Model Support: We support RoBERTa models as an alternative backbone by specifying
--re-model
or--qa-model
accordingly. - Data Preprocessing: To prevent redundant preprocessing, processed data is cached in the
cache
directory as.pkl
files. If you alter preprocess.py, clear the cache to apply new preprocessing changes. - Evaluation: In addition to the metrics outlined in the Evaluation section, we include
cl_acc
to assess the accuracy of answer type classification.
By default, the script is set to save the top 3 Retriever (RE
) checkpoints based on F1 scores and the top 3 Reader (QA
) checkpoints based on joint F1 scores. Ensure your training and evaluation aligns with these best practices for optimal results.
Once training is complete, assess the performance of your model using the official HotpotQA evaluation script. The script, hotpot_official_evaluate.py
, is available in the code
directory and allows for a comprehensive analysis of your model's predictions.
Execute the following command to evaluate your model's predictions:
python code/hotpot_official_evaluate.py --prediction-file model_pred.json --gold-file HotpotQAData/hotpot_dev_distractor_v1.json
Replace model_pred.json
with the path to your model's prediction file. Ensure that the hotpot_dev_distractor_v1.json
file is located in the HotpotQAData
folder, as per the setup instructions in the Data section.
The script outputs several key metrics to gauge the performance of your model:
- sp_em, sp_f1, sp_prec, and sp_recall: These metrics evaluate the correctness of supporting facts' judgments. They measure the exact match (em), precision (prec), recall, and F1 score (f1) specifically for supporting facts identification.
- em, f1, prec, and recall: These metrics assess the accuracy of answer span extraction, evaluating how well the model identifies the exact answers within the text.
- joint_em, joint_f1, joint_prec, and joint_recall: These combined metrics provide an overall assessment of your model's performance, taking into account both the accuracy of supporting fact judgments and answer span extraction.
We welcome your feedback and questions! If you have any suggestions, or need to get in touch, please don't hesitate to email us at yinzhangyue@126.com. For issues related to the code or any bugs you might encounter, we encourage you to open a new issue on GitHub. As this is an initial release, your constructive feedback is invaluable to us in making improvements. Thank you for your support and involvement!
If you are interested in our work, please use the following citation format when referencing our paper:
@InProceedings{yin-etal-2023-rethinking,
author="Yin, Zhangyue
and Wang, Yuxin
and Hu, Xiannian
and Wu, Yiguang
and Yan, Hang
and Zhang, Xinyu
and Cao, Zhao
and Huang, Xuanjing
and Qiu, Xipeng",
editor="Sun, Maosong
and Qin, Bing
and Qiu, Xipeng
and Jing, Jiang
and Han, Xianpei
and Rao, Gaoqi
and Chen, Yubo",
title="Rethinking Label Smoothing on Multi-Hop Question Answering",
booktitle="Chinese Computational Linguistics",
year="2023",
publisher="Springer Nature Singapore",
address="Singapore",
pages="72--87",
isbn="978-981-99-6207-5"
}