Automatic identification of factual claims that are worthy of a fact-check
FactRank (factrank.org) is a novel claim detection tool for journalists specifically created for the Dutch language. To the best of our knowledge, this is the first and still the only such tool for Dutch. FactRank thus complements existing online claim detection tools for English and (a small number of) other languages. FactRank performs similarly to claim detection in ClaimBuster, the state-of-the-art fact-checking tool for English. Our comparisons with a human baseline also indicate that given how much even expert human fact-checkers disagree, there may be a natural “upper bound” on the accuracy of check-worthiness detection by machine-learning methods.
PAPER: https://www.sciencedirect.com/science/article/abs/pii/S2468696420300549
More information: https://factrank.org/#/about
pip install "git+https://github.com/lejafar/FactRank.git#egg=factrank"
requires poetry (>= 1.0.0)
git clone https://github.com/lejafar/FactRank.git
cd FactRank
make # creates virtual-env and installs pre-commit hooks
import factrank
text = "Het aantal mensen dat sterft aan covid blijft zorgwekkend oplopen."
(confidence, statement), = factrank.checkworthyness(text)
print("statement": statement)
print("checkworthyness": confidence)
All data is available here.
The original dataset statements_{train,test,val}.csv
has been coded by journalism students from the University of Leiden in 2019, more information about the process is available in the paper.
The data was coded in 3 categories:
- NF: Not factual
- FR: Factual and relevant
- FNR: Factual and non-relevant
The online tool, available at factrank.org allows users to give feedback on whether they think a statement is worth a factcheck or not, this yield the following binary classification:
- FR: Factual and relevant
- NFR: Not Factual and relevant
In order to combine this information we can either:
- only use the positive examples (FR)
- combine the negative examples (FR+FNR) from the original dataset into a (NFR) category and discard the distinction between them
Currently only the first option has been explored due to time constraints. The original dataset combined with the positive examples can be found here.
The model is an implementation of Yoon Kim's Convolutional Neural Networks for Sentence Classification. Experiments have been conducted with used more convolutional layers but this did not yield an increase in performance.
Pre-processing mainly takes place in tokenize.py
. It handles cutting text into sentences, cleans up each sentence and adds NUM
POS-tags.
The built-in sentencizer in https://spacy.io/ did not meet our requirements, it did not handle cases like a .
inside quotes, etc...
- normalization of quotes, whitespaces, dots, ...
- removal of unwanted symbols
§
,…
,=
,•
- remove leading and trailing Twitter
@
and#
tags
Word embeddings were taken from https://github.com/clips/dutchembeddings#embeddings (COW, Big). These embeddings are being held static during training for empirical reasons.
The list of embeddings included in this repo cow-big-slim.txt
is a subset, that contains all words that occur in our training data.
These embeddings are also used during inference and are embedded in the pickled statement processor factnet.statement-processor.pth
Experiments have been done using a (frozen) pre-trained Bert model as a backbone, the implementation of FactNetTransformer
can be found in factnet.py
. The increase in performance was limited, but the increase in required compute was very significant. That's why we're currently using the CNN based model by default.
Training can be started as follows:
poetry run python -m factrank.train --run experiment1 --dropout 0.8
This will use the defaults in options.py
and overwrite the values that are given as CLI arguments (ex. run
& dropout
). A folder log/experiment1
will be created, containing the logs, model weights and the options used.