This repo contains the code of our ACL 2022 paper, On Length Divergence Bias in Textual Matching Models.
If you find our project useful in your research, please consider citing:
@inproceedings{jiang-etal-2022-length,
title = "On Length Divergence Bias in Textual Matching Models",
author = "Jiang, Lan and Lyu, Tianshu and Lin, Yankai and Chong, Meng and Lyu, Xiaoyong and Yin, Dawei",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
year = "2022",
}
In this work, we provide a new perspective to study this issue — via the length divergence bias. We find that:
- length divergence heuristic widely exists in prevalent TM datasets, providing direct cues for prediction.
- TM models have adopted such heuristic, and such bias can be attributed in part to extracting the text length information during training.
To alleviate the length divergence bias, we propose an adversarial training method.
conda create -n length python=3.6
conda activate length
git clone https://github.com/jiangllan/LengthBias.git
cd LengthBias && pip install -r requirements.txt
To obtain the balanced datasets, please run the process.py
in dataset
cd dataset
python process.py
Finally, the ./dataset
foler will have teh following structure:
'|-- dataset',
' |-- process.py',
' |-- Microblog',
' | |-- dev.csv',
' | |-- train.csv',
' | |-- balanced',
' | |-- dev.csv',
' | |-- dev.inr',
' | |-- train.csv',
' | |-- train.inr',
' |-- QQP',
' | |-- dev.csv',
' | |-- train.csv',
' | |-- balanced',
' | |-- dev.csv',
' | |-- dev.inr',
' | |-- train.csv',
' | |-- train.inr',
' |-- TrecQA',
' | |-- dev.csv',
' | |-- train.csv',
' | |-- balanced',
' | |-- dev.csv',
' | |-- dev.inr',
' | |-- train.csv',
' | |-- train.inr',
' |-- Twitter-URL',
' |-- dev.csv',
' |-- train.csv',
' |-- balanced',
' |-- dev.csv',
' |-- dev.inr',
' |-- train.csv',
' |-- train.inr',
Coming soon...
jiangl20 at mails dot tsinghua dot edu dot cn