ARSJoint

This is the implementation of EMNLP2021 paper Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification. We verify our approach based on the SciFact benchmark dataset

Dataset

We breifly describe the dataset as follows. For detail about this dataset, please refer to SCIVER.

A list of abstracts from the corpus containing relevant evidence.
A label indicating whether each abstract Supports or Refutes the claim.
All evidence sets found in each abstract that justify the label. An evidence set is a collection of sentences that, taken together, verifies the claim. Evidence sets can be one or more sentences.

An example of a claim paired with evidence from a single abstract is shown below.

{
  "id": 52,
  "claim": "ALDH1 expression is associated with poorer prognosis for breast cancer primary tumors.",
  "evidence": {
    "11": [                     # 2 evidence sets in document 11 support the claim.
       {"sentences": [0, 1],    # Sentences 0 and 1, taken together, support the claim.
        "label": "SUPPORT"},
       {"sentences": [11],      # Sentence 11, on its own, supports the claim.
        "label": "SUPPORT"}
    ],
    "15": [                     # A single evidence set in document 15 supports the claim.
       {"sentences": [4], 
        "label": "SUPPORT"}
    ]
  },
  "cited_doc_ids": [11, 15]
}

Evaluation

We evaluate our approach following the evaluation method used by SciFact and SCIVER.

Two task of evaluation are used. We breifly describe them as follows. For detail about this evaluation method, please refer to the URLs.

Abstract-level evaluation

Abstract-level evaluation is similar to the FEVER score, described in the FEVER paper (Thorne et al., 2018). A predicted abstract is Correct if:

The predicted abstract is a relevant abstract.
The abstract's predicted Support or Refute label matches its gold label.
The abstract's predicted evidence sentences contain at least one full gold evidence set. Inspired by the FEVER score, the number of predicted sentences is limited to 3.

We then compute the Precision(P), Recall(R), and F1-score(F1) over all predicted abstracts.

Sentence-level evaluation

Sentence-level evaluation scores the correctness of the individual predicted evidence sentences. A predicted sentence Correct if:

The abstract containing the sentence is labeled correctly as Support or Refute.
The sentence is part of some gold evidence set.
All other sentences in that same gold evidence set are also identified by the model as evidence sentences.

We then compute the Precision(P), Recall(R), and F1-score(F1) over all predicted evidence sentences.

Here's a simple step-by-step example showing how these metrics are calculated.

Dependencies

We recommend you create an anaconda environment:

conda create --name scifact python=3.7 conda-build

Then, from the scifact project root, run

conda develop .

Then, install Python requirements:

pip install -r requirements.txt

If you encounter any installation problem regarding sent2vec, please check their repo. The BioSentVec model is available here.

The checkpoints of our ARSJoint model (trained on training set) are available here (ARSJoint (RoBERTa-large), ARSJoint w/o RR (RoBERTa-large), ARSJoint (BioBERT-large), ARSJoint w/o RR (BioBERT-large)).

Hyperparameters Tuning with Optuna

Run OptunaMain.py to tune the hyperparameters ($\lambda_1$, $\lambda_2$, $\lambda_3$, $\gamma$) in the joint loss used in the experiments of the paper. If you encounter any problem regarding Optuna, please check their repo.

model	$\lambda_1$	$\lambda_2$	$\lambda_3$	$\gamma$
ARSJoint w/o RR (RoBERTa-large)	2.7	11.7	2.2	-
ARSJoint (RoBERTa-large)	0.9	11.1	2.6	2.2
ARSJoint w/o RR (BioBERT-large)	0.1	10.8	4.7	-
ARSJoint (BioBERT-large)	0.2	12.0	1.1	1.9

Pre-processing

The files **AbstractRetrieval.py are the scripts for selecting top-k candidate abstract. Note that, if using BioSenVecAbstractRetrieval.py, please run ComputeBioSentVecAbstractEmbedding.py first.

Training and Prediction

Run main.py to train or prediction. Use --state to specify whether the runing state is training or prediction. Use --checkpoint to specify the checkpoint path.

Baseline

We compare our ARSJOINT approach with Paragraph-Joint, VERISCI and VERT5ERINI.

Citation

@inproceedings{ARSJoint,
    title = "Abstract, Rationale, Stance: A Joint Model for Scientific Claim Verification",
    author = "Zhang, Zhiwei  and  Li, Jiyi  and  Fukumoto, Fumiyo  and  Ye, Yanming",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    year = "2021",
    pages = "3580--3586",
}

ZhiweiZhang97/ARSJointModel