Repository for my masters' thesis.
- check-worthy claim - a claim for which the general public would be interested in knowing the truth
- it's important that the claim is even checkable
- e.g. "I'm thinking about my thesis" is a factual statement, although it isn't checkable
The subtasks needed to solve this problem:
-
claim detection
- identify claims which require verification
- binary classification problem or importance ranking
-
evidence retrieval
- find information to refute or confirm the claim
- important to take into account outer knowledge and essential for verdict justification
- not all information is trustworthy - usually, some information sources are considered to be implicitly trustworthy
-
claim verification
-
verdict prediction
- determines the veracity of the claim based on retrieved evidence
- binary classification problem, multi class classification problem (true, false, not enough information) or multi label classification problem
-
justification production
- produces an explanation for verdict prediction
- using attention weights, logic-based explanation or summarization by generating explanations
-
-
checkworthiness:
- CLEF2023 CheckThat! Task 1: Check-Worthiness in Multimodal and Unimodal Content (CT23)
-
verification:
- FEVER with generated adversarial claims (FEVER)
-
BERT, RoBERTa, and ELECTRA - all base models
-
pretraining:
- 800 steps
- effective batch size - 2048
- optimizer - AdamW
- learning rate - 1e-4
- weight decay - 1e-2
- scheduler - linear, warmup ratio 0.1
-
fine-tuning:
- batch size - 16
- learning rate - 2e-5
- rest is the same as in pretraining
- how much of an improvement does pretraining introduce?
CT23
Model | Train set | Validation set | Test set | Initialization |
---|---|---|---|---|
BERT base | 91.75 | 77.57 | 81.07 | Random |
RoBERTa base | 87.74 | 77.05 | 81.07 | Random |
ELECTRA base | 92.91 | 76.62 | 82.89 | Random |
BERT base | 99.00 | 83.49 | 85.61 | Pretrained |
RoBERTa base | 99.55 | 84.13 | 88.71 | Pretrained |
ELECTRA base | 99.71 | 84.63 | 85.10 | Pretrained |
FEVER
Model | Train set | Validation set | Test set | Initialization |
---|---|---|---|---|
BERT base | 78.04 | 49.67 | 48.15 | Random |
RoBERTa base | 83.35 | 51.12 | 49.24 | Random |
ELECTRA base | 83.64 | 51.10 | 48.10 | Random |
BERT base | 99.75 | 79.75 | 78.13 | Pretrained |
RoBERTa base | 97.24 | 83.26 | 81.61 | Pretrained |
ELECTRA base | 97.16 | 83.17 | 81.11 | Pretrained |
- Task-adaptive pretraining (TAPT) refers to pre-training on the unlabeled training set for a given task
CT23
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + shuffle randomization | 99.51 | 83.36 | 83.46 |
BERT base + MLM | 99.98 | 84.27 | 85.43 |
RoBERTa base + shuffle randomization | 99.81 | 84.68 | 86.62 |
RoBERTa base + MLM | 97.89 | 84.51 | 85.43 |
ELECTRA base + ELECTRA | 97.42 | 84.61 | 88.64 |
FEVER
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + shuffle randomization | 95.95 | 79.51 | 77.82 |
BERT base + MLM | 99.68 | 79.67 | 77.89 |
RoBERTa base + shuffle randomization | 99.26 | 83.45 | 81.21 |
RoBERTa base + MLM | 95.90 | 83.20 | 81.18 |
ELECTRA base + ELECTRA | 99.64 | 83.65 | 82.01 |
- a modification of the masking procedure in standard MLM
Selective MLM
- Randomly selects 15% of the tokens
+ Randomly selects 15% of the tokens, such that:
+ - 80% of the tokens correspond to important words
+ - 20% of the tokens correspond to other words
Out of the chosen 15% of the tokens:
- Replaces 80% of the tokens with the [MASK] token
- Replaces 10% of the tokens with a random token
- The rest of the tokens are left as is
- important words - named entities, numbers, predicates, subjects and objects
CT23
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + selective MLM | 99.45 | 83.82 | 85.94 |
RoBERTa base + selective MLM | 99.36 | 83.48 | 83.24 |
FEVER
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + selective MLM | 99.72 | 79.94 | 78.51 |
RoBERTa base + selective MLM | 99.11 | 83.32 | 81.41 |
- adding another pretraining task because not all information important words provide is used
- binary classification token-level task - each token is either important or unimportant
CT23
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + selective MLM + multitask SMLM | 99.43 | 83.73 | 83.13 |
RoBERTa base + selective MLM + multitask SMLM | 99.71 | 84.26 | 84.57 |
FEVER
Model and pretraining method | Train set | Validation set | Test set |
---|---|---|---|
BERT base + selective MLM + multitask SMLM | 91.17 | 79.83 | 77.56 |
RoBERTa base + selective MLM + multitask SMLM | 99.15 | 83.49 | 81.18 |
- modifying the input text by adding a special token [IMPORTANT] after each important word
- adds a randomly intialized embedding to the vocabulary
CT23
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 98.64 | 83.38 | 88.03 |
RoBERTa base | 99.34 | 84.39 | 85.79 |
ELECTRA base | 99.01 | 83.32 | 86.37 |
FEVER
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 93.20 | 79.03 | 77.53 |
RoBERTa base | 99.10 | 82.86 | 81.20 |
ELECTRA base | 99.42 | 83.44 | 81.96 |
- first fine-tuning on CT23 and then on FEVER and vice-versa
CT23
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 99.89 | 83.27 | 83.90 |
RoBERTa base | 99.68 | 84.20 | 86.97 |
ELECTRA base | 99.72 | 84.24 | 86.63 |
FEVER
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 99.57 | 78.50 | 77.20 |
RoBERTa base | 99.08 | 82.82 | 81.25 |
ELECTRA base | 99.38 | 83.06 | 80.80 |
- weighting the loss with class weights
- weights are the ratio of the number of all examples and the number of examples for a certain label
CT23
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 98.60 | 84.15 | 86.79 |
RoBERTa base | 99.60 | 84.62 | 85.61 |
ELECTRA base | 99.81 | 84.10 | 86.63 |
FEVER
Model | Train set | Validation set | Test set |
---|---|---|---|
BERT base | 99.71 | 79.71 | 78.27 |
RoBERTa base | 99.25 | 82.88 | 81.56 |
ELECTRA base | 97.11 | 83.55 | 81.81 |
- ranking examples by the difference of 1 and the correct class probability in descending order
- done for the pretrained BERT baseline and the adaptively pretrained BERT with selective MLM
CT23 & BERT baseline
Rank | Claim | Label |
---|---|---|
1 | He thinks wind causes cancer, windmills. | Yes |
2 | The general who was with him said All he ever wants to do is divide people not unite people at all. | Yes |
3 | The Green New Deal is not my plan... | Yes |
4 | So why didn’t he do it for 47 years? | Yes |
5 | And the fact is, I’ve made it very clear, within 100 days, I’m going to send to the United States Congress a pathway to citizenship for over 11 million undocumented people. | No |
6 | And we’re in a circumstance where the President, thus far, still has no plan. | Yes |
7 | Many of your Democrat Governors said, President Trump did a phenomenal job. | Yes |
8 | Yeah, you did say that. | Yes |
9 | Some of these ballots in some states can’t even be opened until election day. | Yes |
10 | Because it costs a lot of money to open them safely. | Yes |
CT23 & SMLM pretrained BERT
Rank | Claim | Label |
---|---|---|
1 | He thinks wind causes cancer, windmills. | Yes |
2 | We took away the individual mandate. | Yes |
3 | And we’re in a circumstance where the President, thus far, still has no plan. | Yes |
4 | And it’s gone through, including the Democrats, in all fairness. | Yes |
5 | The individual mandate – where you have to pay a fortune for the privilege of not having to pay for bad health insurance. | Yes |
6 | The places we had trouble were Democratic-run cities... | Yes |
7 | Manufacturing went in a hole- | Yes |
8 | Dr. Fauci said the opposite. | Yes |
9 | So why didn’t he do it for 47 years? | Yes |
10 | Many of your Democrat Governors said, President Trump did a phenomenal job. | Yes |
- the most prominent failure mode is the lack of contextual information required for determining checkworthiness
- other occasionally occuring mode is lack of specifity when referring to a group
- "Many of your Democrat Governors said" - which governors is it referring to?
FEVER & BERT baseline
Rank | Claim | Evidence | Label |
---|---|---|---|
1 | Stephen Moyer acted in a film adaptation of a comic. | In 1997, Moyer made his big-screen debut landing the lead role in the film adaptation of the long-running comic strip Prince Valiant by Hal Foster, working alongside Ron Perlman and Katherine Heigl. | SUPPORTS |
2 | Emmanuel Macron worked as a banker. | A former civil servant and investment banker, he studied philosophy at Paris Nanterre University, completed a Master’s of Public Affairs at Sciences Po, and graduated from the École nationale d’administration (ENA) in 2004. | SUPPORTS |
3 | Jerome Flynn was born in March. | Jerome Patrick Flynn (born 16 March 1963) is an English actor and singer. | SUPPORTS |
4 | Kevin Bacon played a former child abuser. | Bacon is also known for taking on darker roles such as that of a sadistic guard in Sleepers and troubled former child abuser in a critically acclaimed performance in The Woodsman. | SUPPORTS |
5 | Mandy Moore is a musician. Outside of her musical career, Moore has also branched out into acting. | SUPPORTS |
FEVER & SMLM pretrained BERT
Rank | Claim | Evidence | Label |
---|---|---|---|
1 | American Sniper (book) is about a sniper with 170 officially confirmed kills. | With 255 kills, 160 of them officially confirmed by the Pentagon, Kyle is the deadliest marksman in U.S. military history. | REFUTES |
2 | Lorelai Gilmore's father is named Robert. | The dynamic of single parenthood and the tension between Lorelai and her wealthy parents, Richard (Edward Herrmann) and especially her controlling mother, Emily (Kelly Bishop), form the main theme of the series story line. | REFUTES |
3 | Pearl (Steven Universe) projects a holographic butterfly. | She is a ``Gem'', a fictional alien being that exists as a magical gemstone projecting a holographic body. | REFUTES |
4 | Houses became Virginia’s most valuable export in 2002. | Virginia’s economy changed from primarily agricultural to industrial during the 1960s and 1970s, and in 2002 computer chips became the state’s leading export by monetary value. | REFUTES |
5 | Temple of the Dog celebrated the 37th anniversary of their self-titled album. | The band toured in 2016 in celebration of the 25th anniversary of their self-titled album. | REFUTES |
- again, no prominent failure modes
- example with rank 1 in FEVER & SMLM pretrained BERT table shows very human-like judgement:
- 160 kills are confirmed by the Pentagon, but the claim uses 170 which is very close
- the prediction for the example is NOT ENOUGH INFO
- the inclusion of the NOT ENOUGH INFO class might actually pose problems in borderline cases
- pretraining from scratch
- selecting words for masking using TF-IDF scores - completely unsupervised