Current causal text mining datasets vary in objectives, data coverage, and annotation schemes. These inconsistent efforts prevented modeling capabilities and fair comparisons of model performance. Few datasets include cause-effect span annotations, which are needed for end-to-end causal relation extraction. Therefore, we introduce UniCausal, a unified benchmark and model for causal text mining, based on six popular causal datasets and three common tasks.
The six datasets reflect a variety of sentence lengths, linguistic constructions, argument types, and more.
(I) Sequence Classification
(II) Cause-Effect Span Detection
(III) Pair Classification
For more details and analysis, please refer to our corresponding paper titled "UniCausal: Unified benchmark and model for causal text mining".
Create virtual environment and download dependencies based on requirements.txt
. If using conda
, you may install the packages using extended_requirements.txt
.
A key novelty of our framework is that once users download our repository, they can directly "call" the datasets to design Causal Text Mining models.
We provide a tutorial to load datasets at tutorials/Loading_CTM_datasets.ipynb
. The main function to call is as follows:
from _datasets.unifiedcre import load_cre_dataset, available_datasets
print('List of available datasets:', available_datasets)
"""
Example case of loading AltLex and BECAUSE dataset,
without adding span texts to seq texts, span augmentation or user-provided datasets,
and load both training and validation datasets.
"""
load_cre_dataset(dataset_name=['altlex','because'], do_train_val=True, data_dir='../data')
We adapted the Huggingface Sequence Classification and Token Classification scripts to create baselines per task. The codes are available as follows:
(I) run_seqbase.py
: Sequence Classification
(II) run_tokbase.py
: Token Classification a.k.a. Cause-Effect Span Detection
(III) run_pairbase.py
: Pair Classification
We uploaded our bert-base-cased
model adapted onto all datasets per task onto Huggingface Hub. Users who wish to plug and play can do so by calling the following pretrained model names directly:
(I) tanfiona/unicausal-seq-baseline
: Sequence Classification
(II) tanfiona/unicausal-tok-baseline
: Token Classification a.k.a. Cause-Effect Span Detection
(III) tanfiona/unicausal-pair-baseline
: Pair Classification
You may also play around with the Hosted Inference API on Huggingface Hub to directly try your own input sentences without any coding!
Sequence Classification, where LABEL_1=Causal and LABEL_0=Non-causal, using Hosted Inference API on Hugginface. Try it yourself! |
- AltLex (Hidey and McKweon, 2016)
- BECAUSE 2.0 (Duneitz et al., 2017)
- CausalTimeBank (CTB) (Mirza et al., 2014; Mirza and Tonelli, 2014)
- EventStoryLine V1.0 (ESL) (Caselli and Vossen, 2017)
- Penn Discourse Treebank V3.0 (PDTB) (Webber et al., 2019)
- SemEval 2010 Task 8 (SemEval) (Hendrickx et al., 2010)
Our codes follow the GNU GPL License. For the data, you must refer to individual datasets’ licenses. The following datasets had publicly available licenses:
- BECAUSE 2.0: MIT License
- EventStoryLine V1.0: CC License
- Penn Discourse Treebank V3.0: LDC User Agreement for Non-Members
Unfortunately, we were unable to find licensing information for AltLex, CausalTimeBank and SemEval 2010 Task 8. If you manage to find them, kindly inform us.
If you used our repository or found it helpful in any way, please do cite us in your work:
@inproceedings{DBLP:conf/dawak/TanZN23,
author = {Fiona Anting Tan and
Xinyu Zuo and
See{-}Kiong Ng},
editor = {Robert Wrembel and
Johann Gamper and
Gabriele Kotsis and
A Min Tjoa and
Ismail Khalil},
title = {UniCausal: Unified Benchmark and Repository for Causal Text Mining},
booktitle = {Big Data Analytics and Knowledge Discovery - 25th International Conference,
DaWaK 2023, Penang, Malaysia, August 28-30, 2023, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {14148},
pages = {248--262},
publisher = {Springer},
year = {2023},
url = {https://doi.org/10.1007/978-3-031-39831-5\_23},
doi = {10.1007/978-3-031-39831-5\_23},
timestamp = {Fri, 18 Aug 2023 08:45:01 +0200},
biburl = {https://dblp.org/rec/conf/dawak/TanZN23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
If you have feedback or features/datasets you would like to contribute, please email us at tan.f[at]u.nus.edu.
[Current version: 1.0.0]