self_pretraining: A Python repository from Rid-Yi

This is a Python code for the paper below:
Semi-Supervised Text Classification via Self-Pretraining, Payam Karisani, Negin Karisani. WSDM 2021. Link

Pre-requirements

Python (>= 3.7.0)
Numpy (>= 1.18.5)
Pytorch (>= 1.4.0)

Input
The input files (train, test, or unlabeled) should contain one document per line. Each line should have 4 attributes (tab separated):

A unique document id (integer)
A binary label (integer):
- The number 1 for negative documents
- The number 3 for positive documents
- If the document is unlabeled this column is ignored
Domain (string): a keyword describing the topic of the document
Document body (string)

See the file “sample.train” for a sample input.

Training and Evaluation
Below you can see an example command to run the code. This command tells the code to use a subset of the documents in the training and the unlabeled sets to train a model and evaluate in the test set—F1 measure is printed at the end of the execution.

python -m self_pretraining.src.MainThread --cmd bert_reg \
--itr 3 \
--model_path /user/desktop/bert-base-uncased/ \
--train_path /user/desktop/data/data.train \
--test_path /user/desktop/data/data.test \
--unlabeled_path /user/desktop/data/data.unlabeled \
--output_dir /user/desktop/output \
--device 0 \
--seed 666 \
--train_sample 500 \
--unlabeled_sample 10000

The arguments are explained below:

“--itr”: The number of iterations to run the experiment with different random seeds
“--model_path”: The path to the huggingface pretrained bert
“--train_path”: The path to the train file
“--test_path”: The path to the test file
“--unlabeled_path”: The path to the unlabeld file
“--output_dir”: A directory to be used for temporary files
“--device”: GPU identifier
“--seed”: Random seed
“--train_sample”: The number of documents to sample from the original labeled set to be used as the training data
“--unlabeled_sample”: The number of unlabeled documents to sample from the unlabeled set to be used in the model

Notes

The code uses the huggingface pretrained bert model: Link
The hyper-paremeters are set to their default values reported in the paper. You may need to change them, they are global values in the class “EPretrainProj”
The batch size is set to 32.

Rid-Yi/self_pretraining