2020 Fall, Internet Data Mining course, PKU haoshibo@pku.edu.cn
A bert-based model to distinguish between human-written text and machine-generated text. Note: the repo isn't well-organized, e.g. some of the codes might be commented out for efficiency in debug. You may have to select some contents what you really want.
- python 3.6
- pytorch 1.4
- Transformers 3.5.1
- first transform the files to jsonlines, like:
{"text": "......", "label": "neg"}
{"text": "......", "label": "neg"}
...
For this course, preprocess_dm.py
transforms the dataset to such formats.
-
second, split a validation set from training set, and generate encodings and labels with bert tokenizer, the result files is
train_labels.pkl
,train_encodings.pkl
, and counterparts of val and test set. -
for a coherence feature mentioned in report, two adjacent sentences are concatenated to be fed into tokenizer.
-
The process would be killed when executing
pickle.dump
, probably because of lack of memory. This is a bug to be fixed.
bert_main.py
use DistilBertForSequenceClassification models to train a 2-class classification model.- best model is saved in
state_distilled_bert_best.pth
- run
bert_test.py
- approximately 0.80 accuracy can be achieved on test set. Details are in the report.