This is the implementation of EMNLP2021 paper Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification. We verify our approach based on the SciFact benchmark dataset
We breifly describe the dataset as follows. For detail about this dataset, please refer to SCIVER.
- A list of abstracts from the corpus containing relevant evidence.
- A label indicating whether each abstract Supports or Refutes the claim.
- All evidence sets found in each abstract that justify the label. An evidence set is a collection of sentences that, taken together, verifies the claim. Evidence sets can be one or more sentences.
An example of a claim paired with evidence from a single abstract is shown below.
{
"id": 52,
"claim": "ALDH1 expression is associated with poorer prognosis for breast cancer primary tumors.",
"evidence": {
"11": [ # 2 evidence sets in document 11 support the claim.
{"sentences": [0, 1], # Sentences 0 and 1, taken together, support the claim.
"label": "SUPPORT"},
{"sentences": [11], # Sentence 11, on its own, supports the claim.
"label": "SUPPORT"}
],
"15": [ # A single evidence set in document 15 supports the claim.
{"sentences": [4],
"label": "SUPPORT"}
]
},
"cited_doc_ids": [11, 15]
}
We evaluate our approach following the evaluation method used by SciFact and SCIVER.
Two task of evaluation are used. We breifly describe them as follows. For detail about this evaluation method, please refer to the URLs.
Abstract-level evaluation
Abstract-level evaluation is similar to the FEVER score, described in the FEVER paper (Thorne et al., 2018). A predicted abstract is Correct if:
- The predicted abstract is a relevant abstract.
- The abstract's predicted Support or Refute label matches its gold label.
- The abstract's predicted evidence sentences contain at least one full gold evidence set. Inspired by the FEVER score, the number of predicted sentences is limited to 3.
We then compute the Precision(P), Recall(R), and F1-score(F1) over all predicted abstracts.
Sentence-level evaluation
Sentence-level evaluation scores the correctness of the individual predicted evidence sentences. A predicted sentence Correct if:
- The abstract containing the sentence is labeled correctly as Support or Refute.
- The sentence is part of some gold evidence set.
- All other sentences in that same gold evidence set are also identified by the model as evidence sentences.
We then compute the Precision(P), Recall(R), and F1-score(F1) over all predicted evidence sentences.
Here's a simple step-by-step example showing how these metrics are calculated.
We recommend you create an anaconda environment:
conda create --name scifact python=3.7 conda-build
Then, from the scifact
project root, run
conda develop .
Then, install Python requirements:
pip install -r requirements.txt
If you encounter any installation problem regarding sent2vec, please check their repo. The BioSentVec model is available here.
The checkpoints of our ARSJoint model (trained on training set) are available here (ARSJoint (RoBERTa-large), ARSJoint w/o RR (RoBERTa-large), ARSJoint (BioBERT-large), ARSJoint w/o RR (BioBERT-large)).
Run OptunaMain.py
to tune the hyperparameters (
model | ||||
---|---|---|---|---|
ARSJoint w/o RR (RoBERTa-large) | 2.7 | 11.7 | 2.2 | - |
ARSJoint (RoBERTa-large) | 0.9 | 11.1 | 2.6 | 2.2 |
ARSJoint w/o RR (BioBERT-large) | 0.1 | 10.8 | 4.7 | - |
ARSJoint (BioBERT-large) | 0.2 | 12.0 | 1.1 | 1.9 |
The files **AbstractRetrieval.py
are the scripts for selecting top-k candidate abstract. Note that, if using BioSenVecAbstractRetrieval.py
, please run ComputeBioSentVecAbstractEmbedding.py
first.
Run main.py
to train or prediction. Use --state
to specify whether the runing state is training or prediction. Use --checkpoint
to specify the checkpoint path.
We compare our ARSJOINT approach with Paragraph-Joint, VERISCI and VERT5ERINI.
@inproceedings{ARSJoint,
title = "Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification",
author = "Zhang, Zhiwei and Li, Jiyi and Fukumoto, Fumiyo and Ye, Yanming",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
year = "2021",
pages = "3580--3586",
}