Results From Pan CLEF19 Test Datasets
Dataset |
lang |
type |
gender |
1 |
es |
0.8611 |
0.7556 |
1 |
en |
0.9280 |
0.7652 |
2 |
es |
0.8839 |
0.7261 |
2 |
es |
0.9227 |
0.7583 |
Pan Author Identification (Bots and Gender Profiling)
Identify Author of text on bases of their stylometry and writing style.
Use the package manager pip to install required libraries.
pip install -r requirments.txt
python train.py -i 'trainingdatapath'
python train.py -i '/input/train/data/'
python test.py -i 'testdatapath' -o 'outputpath'
python test.py -i '/input/test/data/' -o '/output/'
1. emoji_count -> Count all kind Kind of emojis
2. face_smiling -> Count ๐๐๐๐๐๐
๐คฃ๐๐๐๐๐๐
3. face_affection -> Count ๐ฅฐ๐๐คฉ๐๐โบ๐๐
4. face_tongue -> Count ๐๐๐๐คช๐๐ค
5. face_hand -> Count ๐ค๐คญ๐คซ๐ค
6. face_neutral_skeptical -> Count ๐ค๐คจ๐๐๐ถ๐๐๐๐ฌ๐คฅ
7. face_concerned -> Count ๐๐๐โน๐ฎ๐ฏ๐ฒ๐ณ๐ฅบ๐ฆ๐ง๐จ๐ฐ๐ฅ๐ข๐ญ๐ฑ๐๐ฃ๐
8. monkey_face -> Count ๐๐๐
9. emotions -> Count ๐๐๐๐๐๐๐๐๐๐โฃ๐โค๐งก๐๐๐๐๐ค๐ค'
10. url_count -> Count all kind of link/urls
11. space_count -> Spaces count
12. capital_count -> Capital letter count
13. text_length -> Total length of messge
14. curly_brackets_count -> Count { }
15. round_brackets_count -> Count ( )
16. underscore_count -> Count _
17. question_mark_count -> Count ?
18. exclamation_mark_count -> Count !
19. dollar_mark_count -> Count $
20. ampersand_mark_count -> Count &
21. hash_count -> Count #
22. tag_count -> Count @
23. slashes_count -> Count Slashes // / \
24. operator_count -> Count Operators +-*/%<>^|
25. punc_count -> Count Puntuations '",.:;`
26. line_count -> Count nextlines \n
27. word_count -> Count Words A-Za-z
Results for English Train Test Split Dataset:
Classifier |
Accuracy |
'LogisticRegression' |
0.9158576051779935 |
'RandomForestClassifier' |
0.9757281553398058 |
'LinearSVC' |
0.8770226537216829 |
'BernoulliNB' |
0.9239482200647249 |
'MultinomialNB' |
0.8236245954692557 |
'SVC' |
0.5056634304207119 |
Best Model RandomForestClassifier
Author |
precision |
recall |
f1-score |
support |
bot |
0.98 |
0.97 |
0.98 |
622 |
human |
0.97 |
0.98 |
0.98 |
614 |
|
|
|
|
|
micro avg |
0.98 |
0.98 |
0.98 |
1236 |
macro avg |
0.98 |
0.98 |
0.98 |
1236 |
weighted avg |
0.98 |
0.98 |
0.98 |
1236 |
Classifier |
Accuracy |
'LogisticRegression' |
0.7265372168284789 |
'RandomForestClassifier' |
0.8106796116504854 |
'LinearSVC' |
0.6019417475728155 |
'BernoulliNB' |
0.616504854368932 |
'MultinomialNB' |
0.616504854368932 |
'SVC' |
0.4967637540453074 |
Best Model RandomForestClassifier
Gender |
precision |
recall |
f1-score |
support |
female |
0.79 |
0.85 |
0.82 |
311 |
male |
0.83 |
0.77 |
0.80 |
307 |
|
|
|
|
|
micro avg |
0.81 |
0.81 |
0.81 |
618 |
macro avg |
0.81 |
0.81 |
0.81 |
618 |
weighted avg |
0.81 |
0.81 |
0.81 |
618 |
Results for Spanish Train Test Split Dataset:
Classifier |
Accuracy |
'LogisticRegression' |
0.8433333333333334 |
'RandomForestClassifier' |
0.9288888888888889 |
'LinearSVC' |
0.7488888888888889 |
'BernoulliNB' |
0.8188888888888889 |
'MultinomialNB' |
0.7644444444444445 |
'SVC' |
0.4888888888888889 |
Best Model RandomForestClassifier
Author |
precision |
recall |
f1-score |
support |
bot |
0.93 |
0.93 |
0.93 |
440 |
human |
0.93 |
0.93 |
0.93 |
460 |
|
|
|
|
|
micro avg |
0.93 |
0.93 |
0.93 |
900 |
macro avg |
0.93 |
0.93 |
0.93 |
900 |
weighted avg |
0.93 |
0.93 |
0.93 |
900 |
Classifier |
Accuracy |
'LogisticRegression' |
0.6844444444444444 |
'RandomForestClassifier' |
0.7844444444444445 |
'LinearSVC' |
0.5666666666666667 |
'BernoulliNB' |
0.6066666666666667 |
'MultinomialNB' |
0.6355555555555555 |
'SVC' |
0.48444444444444446 |
Best Model RandomForestClassifier
Gender |
precision |
recall |
f1-score |
support |
female |
0.77 |
0.83 |
0.80 |
232 |
male |
0.80 |
0.74 |
0.77 |
218 |
|
|
|
|
|
micro avg |
0.78 |
0.78 |
0.78 |
450 |
macro avg |
0.79 |
0.78 |
0.78 |
450 |
weighted avg |
0.79 |
0.78 |
0.78 |
450 |