This repository contains the code for BERT-PLI in our IJCAI-PRICAI 2020 paper: BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval.
-
./model/nlp/BertPoint.py
: model for Stage2: fine-tune a paragraph pair classification Task. -
./model/nlp/BertPoolOutMax.py
: model paragraph-level interactions between documents. -
./model/nlp/AttenRNN.py
: aggregate paragraph-level representations.
-
./config/nlp/BertPoint.config
: configuration of./model/nlp/BertPoint.py
(Stage 2, fine-tune). -
./config/nlp/BertPoolOutMax.config
: configuration of./model/nlp/BertPoolOutMax.py
. -
./config/nlp/AttenGRU.config
/./config/nlp/AttenLSTM.config
: configuration of./model/nlp/AttenRNN.py
(GRU / LSTM, repectively)
-
./formatter/nlp/BertPairTextFormatter.py
: prepare input for./model/nlp/BertPoint.py
(Stage 2, fine-tune) -
./formatter/nlp/BertDocParaFormatter.py
: prepare input for./model/nlp/BertPoolOutMax.py
-
./formatter/nlp/AttenRNNFormatter.py
: prepare input for./model/nlp/AttenRNN.py
Examples of input data. Note that we cannot make the raw data public according to the memorandum we signed for the dataset. The examples here have been processed manually and differ from the true data.
-
./examples/task2/data_sample.json
: example input for Stage 2 (fine-tune).The format:
{
"guid": "queryID_paraID",
"text_a": text of the decision paragraph,
"text_b": text of the candidate paragraph,
"label": 0 or 1
}
-
./examples/task1/case_para_sample.json
: example input used in./config/nlp/BertPoolOutMax.config
.The format:
{
"guid": "queryID_docID",
"q_paras": [...], // a list of paragraphs in query case,
"c_paras": [...], // a list of parameters in candidate case,
"label": 0, // 0 or 1, denote the relevance
}
-
./examples/task1/embedding_sample.json
: example input used in./config/nlp/AttenGRU.config
and./config/nlp/AttenLSTM.config
The format:
{
"guid": "queryID_docID",
"res": [[],...,[]], // N * 768, result of BertPoolOutMax,
"label": 0, // 0 or 1, denote the relevance
}
poolout.py
/train.py
/test.py
, main entrance for poolling out, training, and testing.
- See
requirements.txt
-
Stage 1: BM25 Selection:
The BM25 score is calculated according to the standard scoring function. We set
$k_1=1.5$ ,$b=0.75$ . -
Stage 2: BERT Fine-tuning:
python3 train.py -c config/nlp/BertPoint.config -g [GPU_LIST]
-
Stage 3:
Get paragraph-level interactions by BERT:
python3 poolout.py -c config/nlp/BertPoolOutMax.config -g [GPU_LIST] --checkpoint [path of Bert checkpoint] --result [path to save results]
Train
python3 train.py -c config/nlp/AttenGRU.config -g [GPU_LIST] python3 train.py -c config/nlp/AttenLSTM.config -g [GPU_LIST]
Test
python3 test.py -c config/nlp/AttenGRU.config -g [GPU_LIST] --checkpoint [path of model checkpoint] --result [path to save results] python3 test.py -c config/nlp/AttenLSTM.config -g [GPU_LIST] --checkpoint [path of Bert checkpoint] --result [path to save results]
Please visit COLIEE 2019 to apply for the whole dataset.
Please email shaoyq18@mails.tsinghua.edu.cn for the checkpoint of fine-tuned BERT.
We follow the evaluation metrics in COLIEEE 2019. Note that results should be evaluated on the whole document pool (e.g., 200 candidate documents for each query case.)
Please refer to the configuration files for parameters for each step.
For example, in Stage 2,
For more details, please refer to our paper BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval. If you have any questions, please email shaoyq18@mails.tsinghua.edu.cn .
@inproceedings{shao2020bert,
title={BERT-PLI: Modeling Paragraph-Level Interactions for Legal Case Retrieval},
author={Shao, Yunqiu and Mao, Jiaxin and Liu, Yiqun and Ma, Weizhi and Satoh, Ken and Zhang, Min and Ma, Shaoping},
booktitle={Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20},
pages={3501--3507},
year={2020}
}