/CofeNet

Official code of COLING 2022 paper "CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction"

Primary LanguagePythonApache License 2.0Apache-2.0

CofeNet

This is the source code of COLING 2022 paper "CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction". See our paper for more details.

Abstract: Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rulebased approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.

1. Setup

Environment

# Python version==3.7
git clone https://github.com/cofe-ai/CofeNet.git
cd CofeNet
pip install -r requirements.txt

Datasets

The data set is in the ./res directory. Here we give two datasets polnear and riqua in our paper. You can store other datasets here for the framework to read.

./res
├── polnear
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── riqua
│   ├── tag.txt
│   ├── test.txt
│   ├── train.txt
│   ├── valid.txt
│   └── voc.txt
├── others
└── ...

If you want to use other datasets, you need to build 5 files for each dataset. The file names do not change:

  • train.txt, test.txt, valid.txt: Structured Dataset.

Each item of data is stored in a line by json. The key "tokens" is the text words sequence, and "labels" is the corresponding sequence label tag.

{"tokens": ["WikiLeaks", "claims", "`", "state", ...], "labels": ["B-source", "B-cue", "B-content", "I-content", ...]}
  • tag.txt: The set of "labels" in the dataset.
  • voc.txt: Tokens vocabulary for non-pretrained model(i.e., LSTM).

Experiment Configure

Configuration files are stored in the conf/setting directory. Here we give the experimental configuration's name(exp_name) in the paper so that you can quickly reproduce the experimental results. You can also configure your experiments here.

Base Model Dateset Base with CRF with Cofe Dateset Base with CRF with Cofe
Embedding polnear pn_emb pn_emb_crf pn_emb_cofe riqua rq_emb rq_emb_crf rq_emb_cofe
CNN polnear pn_cnn pn_cnn_crf pn_cnn_cofe riqua rq_cnn rq_cnn_crf rq_cnn_cofe
GRU polnear pn_gru pn_gru_crf pn_gru_cofe riqua rq_gru rq_gru_crf rq_gru_cofe
LSTM polnear pn_lstm pn_lstm_crf pn_lstm_cofe riqua rq_lstm rq_lstm_crf rq_lstm_cofe
BiLSTM polnear pn_blstm pn_blstm_crf pn_blstm_cofe riqua rq_blstm rq_blstm_crf rq_blstm_cofe
BiLSTM L2 polnear pn_blstm2 pn_blstm2_crf pn_blstm2_cofe riqua rq_blstm2 rq_blstm2_crf rq_blstm2_cofe
BERT polnear pn_bert pn_bert_crf pn_bert_cofe riqua rq_bert rq_bert_crf rq_bert_cofe
BERT-CNN polnear pn_bert_cnn riqua rq_bert_cnn
BERT-LSTM polnear pn_bert_lstm riqua rq_bert_lstm
BERT-BiLSTM polnear pn_bert_blstm pn_bert_blstm_crf riqua rq_bert_blstm rq_bert_blstm_crf

Trained model

Download the trained model. Save in ./conf/models. Reproduce our results by Evaluate.

2. Run

Train

(a) Run the code

# Cofe for polnear
python run_train.py --exp_name pn_emb_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_cnn_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_gru_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_lstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_blstm2_cofe --trn_name v1 --eval_per_step 250 --max_epoch 15 --batch_size 32 --gpu 0
python run_train.py --exp_name pn_bert_cofe --trn_name v1 --eval_per_step 500 --max_epoch 6 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

# Cofe for riqua
python run_train.py --exp_name rq_emb_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_cnn_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_gru_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_lstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_blstm2_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 32 --gpu 0
python run_train.py --exp_name rq_bert_cofe --trn_name v1 --eval_per_step 10 --max_epoch 20 --batch_size 15 --bert_learning_rate 5e-5 --gpu 0

(b) Check log

You can find log files in ./log. For an experiment, here you can find these files:

  • Parameter Configuration (i.e., pn_bert_cofe_v1_20221101_040732.json)
  • Training Log (i.e., pn_bert_cofe_v1_20221101_040732.txt)
  • Tensorboard Files (i.e., pn_bert_cofe_v1_20221101_040732/)

In this example, pn_bert_cofe_v1_20221101_040732 is the unique name of an experiment. It contains the experiment name(pn_bert_cofe), training version(v1), and start training time(20221101_040732).

(c) Run Tensorboard

tensorboard --bind_all --port 9900 --logdir ./log

Evaluate

Run the code to print the experimental results of the trained model.

# Cofe for polnear
python run_eval.py --exp_name pn_emb_cofe --gpu 0
python run_eval.py --exp_name pn_cnn_cofe --gpu 0
python run_eval.py --exp_name pn_gru_cofe --gpu 0
python run_eval.py --exp_name pn_lstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm_cofe --gpu 0
python run_eval.py --exp_name pn_blstm2_cofe --gpu 0
python run_eval.py --exp_name pn_bert_cofe --gpu 0

# Cofe for riqua
python run_eval.py --exp_name rq_emb_cofe --gpu 0
python run_eval.py --exp_name rq_cnn_cofe --gpu 0
python run_eval.py --exp_name rq_gru_cofe --gpu 0
python run_eval.py --exp_name rq_lstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm_cofe --gpu 0
python run_eval.py --exp_name rq_blstm2_cofe --gpu 0
python run_eval.py --exp_name rq_bert_cofe --gpu 0

3. Experiment

CofeNet Detail Experiment here

4. Cite

If the code help you, please cite the following paper.

@inproceedings{wang-etal-2022-cofenet,
    title = "{C}ofe{N}et: Context and Former-Label Enhanced Net for Complicated Quotation Extraction",
    author = "Wang, Yequan  and
      Li, Xiang  and
      Sun, Aixin  and
      Meng, Xuying  and
      Liao, Huaming  and
      Guo, Jiafeng",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.215",
    pages = "2438--2449",
    abstract = "Quotation extraction aims to extract quotations from written text. There are three components in a quotation: \textit{source} refers to the holder of the quotation, \textit{cue} is the trigger word(s), and \textit{content} is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the \textbf{Co}ntext and \textbf{F}ormer-Label \textbf{E}nhanced \textbf{Net} () for quotation extraction. is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (and ) and one proprietary dataset (), we show that our achieves state-of-the-art performance on complicated quotation extraction.",
}