/MGTAB

A Multi-relational Graph-Based Twitter Account Detection Benchmark

Primary LanguagePython

MGTAB

MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark

Introduction

MGTAB is the first standardized graph-based benchmark for stance and bot detection. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. For more details, please refer to the MGTAB paper.

Distribution of labels in annotations.

Stance Bot
Lable Count Percentage Lable Count Percentage
neutral 3,776 37.02 human 7,451 73.06
against 3,637 35.66 bot 2,748 26.94
support 2,786 27.32
MGTAB contains 10,199 expert-annotated users, and 400,000 additional unlabelled users in MGTAB-large compared to MGTAB.

Multiple relations in the MGTAB.

Our proposed dataset has seven types of user relationships.

MGTAB
Edge type followers friends mention reply quoted URL hashtag
Numbers 308,120 412,575 114,516 223,466 77,631 263,800 300,000
MGTAB-large
Edge type followers friends mention reply quoted URL hashtag
Numbers 31,990,488 49,668,723 7,135,192 1,018,834 182,296 51,281 7,950,896

Enviromment

python 3.7
scikit-learn 1.0.2
torch 1.8.1+cu111
torch_cluster-1.5.9
torch_scatter-2.0.6
torch_sparse-0.6.9
torch_spline_conv-1.2.1
torch-geometric 2.0.4
pytorch-lightning 1.5.0

Train Model

To start training process:

Train GNN models

python MGTAB-GNN.py  --task stance --model GCN --relation_select 0 1 --random_seed 0 1 2 3 4
python MGTAB-GNN.py  --task bot --model RGCN --relation_select 0 1 --random_seed 0 1 2 3 4

Train Machine Learning models

python MGTAB-ML.py  --task stance --models_list 1 2 3  --random_seed 0 1 2 3 4
python MGTAB-ML.py  --task bot --models_list 4 5 6 7  --random_seed 0 1 2 3 4

Train GNN models parallel using multi-gpu

python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 4 6 --model RGT --GPU_num 4
python GNN_sample_large.py  --task bot --relation_select 0 1 2 3 4 --model SHGN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 1 --model GCN --GPU_num 4
python GNN_sample_large.py  --task stance --relation_select 0 --model GAT --GPU_num 4

Baseline performance

Stance detection performance on MGTAB

methods type accuracy precision recall f1-score
AdaBoost F 74.59
$_{1.41}$
74.60
$_{1.35}$
74.02
$_{1.61}$
73.88
$_{1.47}$
Random Forest F 79.62
$_{0.68}$
80.04
$_{0.43}$
78.83
$_{0.98}$
79.04
$_{0.82}$
Decision Tree F 66.92
$_{0.93}$
66.34
$_{1.02}$
66.23
$_{1.06}$
66.03
$_{0.84}$
SVM F 81.23
$_{0.66}$
81.40
$_{0.71}$
80.86
$_{1.09}$
80.71
$_{0.78}$
KNN F 76.25
$_{1.32}$
75.54
$_{1.41}$
75.70
$_{1.37}$
75.48
$_{1.37}$
Logistic Regression F 79.51
$_{1.00}$
79.33
$_{0.98}$
78.83
$_{1.17}$
78.98
$_{1.11}$
GCN G 81.35
$_{0.58}$
81.08
$_{0.30}$
80.19
$_{0.56}$
80.08
$_{0.56}$
GrapgSAGE G 83.33
$_{1.22}$
82.52
$_{1.63}$
83.45
$_{0.63}$
82.72
$_{1.34}$
GAT G 82.19
$_{1.23}$
81.72
$_{1.19}$
81.68
$_{1.16}$
81.04
$_{1.24}$
HGT G 83.29
$_{0.44}$
81.63
$_{0.58}$
81.51
$_{0.76}$
81.82
$_{0.34}$
S-HGN G 85.32
$_{0.53}$
83.93
$_{0.67}$
83.65
$_{0.65}$
84.42
$_{0.43}$
BotRGCN G 84.71
$_{1.43}$
83.43
$_{1.23}$
84.08
$_{0.94}$
84.30
$_{1.44}$
RGT G 87.78
$_{0.43}$
85.22
$_{0.89}$
84.40
$_{0.74}$
86.86
$_{0.43}$

Bot detection performance on MGTAB

methods type accuracy precision recall f1-score
AdaBoost F 90.12
$_{0.92}$
88.51
$_{1.33}$
89.10
$_{0.92}$
87.71
$_{1.10}$
Random Forest F 89.52
$_{0.44}$
88.92
$_{0.49}$
86.72
$_{1.15}$
86.83
$_{0.53}$
Decision Tree F 87.13
$_{0.51}$
83.81
$_{0.72}$
83.39
$_{1.06}$
83.70
$_{0.74}$
SVM F 88.68
$_{1.40}$
85.73
$_{1.84}$
85.73
$_{1.84}$
85.31
$_{1.73}$
KNN F 85.78
$_{0.84}$
82.28
$_{1.22}$
80.49
$_{0.64}$
81.28
$_{0.66}$
Logistic Regression F 88.49
$_{1.31}$
85.69
$_{1.69}$
84.41
$_{1.96}$
84.97
$_{1.67}$
GCN G 85.81
$_{1.32}$
77.40
$_{2.12}$
84.37
$_{1.73}$
78.33
$_{1.67}$
GrapgSAGE G 88.71
$_{1.24}$
85.33
$_{1.83}$
86.15
$_{2.55}$
85.44
$_{1.08}$
GAT G 86.96
$_{1.28}$
79.71
$_{2.96}$
84.88
$_{1.52}$
82.33
$_{2.12}$
HGT G 90.28
$_{0.29}$
85.35
$_{0.33}$
85.97
$_{0.41}$
87.52
$_{0.37}$
S-HGN G 91.42
$_{0.43}$
87.40
$_{0.67}$
86.73
$_{0.64}$
88.72
$_{0.58}$
BotRGCN G 89.60
$_{0.82}$
85.21
$_{1.81}$
87.07
$_{1.38}$
87.16
$_{0.74}$
RGT G 92.12
$_{0.37}$
88.08
$_{0.43}$
86.64
$_{0.25}$
90.41
$_{0.47}$

Licensing

The MGTAB dataset uses the CC BY-NC-ND 4.0 license. Implemented code in the MGTAB evaluation framework uses the MIT license.

Datasets download

For SemEval-2016 T6, visit the SemEval2016 repository. For SemEval-2019 T7, visit the SemEval2019 github repository. For TwiBot-20, visit the TwiBot-20 github repository. For TwiBot-22, visit the TwiBot-22 github repository. For other bot detection datasets, please visit the Bot Repository.

MGTAB is available at Google Drive. MGTAB-large (contains 400,000 unlabeled users) is available at Google Drive. We also offer the standardized Cresci-15 at Google Drive. After downloading these datasets, please unzip it into path "./Dataset".