This repository is temporarily associated with paper Lu, J., Henchion, M., Bacher, I. and Mac Namee, B., 2021. A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data. arXiv preprint arXiv:2106.06738. (to be pulished in DS2021 International Conference on Discovery Science)
Tested Python 3.6, and requiring the following packages, which are available via PIP:
- Required: numpy >= 1.19.5
- Required: scikit-learn >= 0.21.1
- Required: pandas >= 1.1.5
- Required: gensim >= 3.7.3
- Required: matplotlib >= 3.3.3
- Required: torch >= 1.9.0
- Required: transformers >= 4.8.2
- Required: Keras >= 2.0.8
- Required: Tensorflow >= 1.14.0
- Required: FastText model trained with Wikipedia 300-dimension
- Required: GloVe model trained with Gigaword and Wikipedia 200-dimension
- Required: packaging >= 20.0
The first step is encoding raw text data into different high-dimensional vectorised representations. The raw text data should be stored in directory "raw_corpora/", each dataset should have its individual directory, for example, the "longer_moviereview/" directory under folder "raw_corpora/". The input corpus of documents should consist of plain text files stored in csv format (two files for one corpus, one for documents belong to class A and one for documents for class B) with a columan named as text. It should be noted that the csv file must be named in the format #datasetname_neg_text.csv or #datasetname_pos_text.csv. Each row corresponding to one document in that corpus, the format can be refered to the csv file in the sample directory "raw_corpora/longer_moviereview/". Then we can start preprocessing text data and converting them into vectors by:
python encode_text.py -d dataset_name -t encoding_methods
The options of -t are hbm
(corresponding to the sentence representation generated by the token-level RoBERTa encoder in the paper), roberta-base
and fasttext
, for example -t roberta-base,fasttext
means encoding documents by RoBERTa and FastText respectively. The encoded documents are stored in directory "dataset/", while the FastText document representations are stored in "datasets/fasttext/" and other representations are stored in "datasets/roberta-base/". It should be noted that the sentence representations for hbm is suffixed by ".pt" and the document representations generated by RoBERTa are suffixed by ".csv"(average all tokens to represent a document) or "_cls.csv" (using classifier token "<s>" to represent a document). Due to the upload file size limit, we did not upload sample ".pt" files but you can generate yours. For encoding by FastText, you need to download the pretrained FastText model in advance (see Dependencies).
We can evaluate the Hierarchical BERT Model (HBM) with limited number of labelled data (in this experiment, we subsample the fully labelled dataset to simulate this low-shot scenario) by:
python run_hbm.py -d dataset_name -l learning_rate -e num_of_epochs -r random_seeds -s training_set_size
The training_set_size
can be random numbers up to 200 (also, you can customise the maximum number by editing the script), for example -s 50,100,150
means the training HBM with 50, 100 and 150 labelled instances respectively. The random_seeds
are random state for subsampling training set from the whole dataset. For example -r 1988,1999 -s 50,100
will training HBM with four different training sets, i.e. 50 labelled instances sampled by seed 1988, 50 labelled instances sampled by seed 1999, 100 labelled instances sampled by seed 1988 and 100 labelled instances sampled by seed 1999.
The script then evaluate the performance of HBM in the rest testing set (i.e. the whole dataset minus the 200 instances that sampled out as the training set, the details can be referred in the paper). The evaluation results are stored in directory "outputs/". Furthermore, the concrete results of each step are stored in "outputs/hbm_results/". The results files starting with "auc_" store the AUC score results while files starting with "raw_" store the confusion matrix (tp, tn, fp, fn).
Similar to the above settings, we can evaluate the fine-tuned RoBERTa performance with limited number of labelled data by:
python run_fine_tuned_roberta.py -d dataset_name -l learning_rate -e num_of_epochs -r random_seeds -s training_set_size
Similarly, the evaluation results are stored in directory "outputs/". Furthermore, the concrete results of each step are stored in "outputs/fine_tuned_results/". It should be noted that the directories "fine_tuned_data/", "fine_tuned_outputs/" and "fine_tuned_cache/" are used for stored auxiliary information generated during the fine-tuning and hence these three directories should be created in advance.
Similar to the above settings, we can evaluate the Hierarchical Attention Networks with limited number of labelled data by:
python run_han.py -d dataset_name -l learning_rate -e num_of_epochs -r random_seeds -s training_set_size
The preprocessing and text encoding (with GloVe) are also integrated into this script and you should download the GloVe model in advance (see dependencies). The evaluation results are stored in dictory "outputs/".
We can also evaluate the performance of RoBERTa+SVM and FastText+SVM withi limited number of labelled data by:
python run_svm-based.py -d dataset_name -t text_representation -r random_seeds -s training_set_size
It should be noted that -t text_representation
indicates the encoding method you choose, the valid options are fasttext
and roberta-base
. The evaluation results are stored in directory "outputs/".
When we run the script in Step 2, besides the AUC scores on testing set, we can also get the attention scores of each sentences that measure whether sentences contribute a lot in forming the document representation. Hence, these attention scores can serve as clue of whether the sentences are important or not. The attention scores are stored in "attentions/#dataset_name/". You can visualise this attention scores by playing with the notebook Visualization_of_informative_sentences.ipynb.
We play with the code using the MovieReview Sentiment dataset consisting of 1000 negative movie reviews and 1000 positive movie reviews. The distribution of the number of sentences per document is shown below (maximum length 118 and avg length 33.97):
We evaluate the performance of various methods by the setting of training size [50, 100], randome states [1988, 1989]. The performance of each method is shown below (AUC score):
You can also play with the notebook to check the informative sentences suggested by the HBM as shown below (examples taken from the testing set and highlighted ones are important sentences):
FastText + SVM: We use 300-dimensional word vectors constructed by a FastText language model pre-trained with the Wikipedia corpus (Joulin et al., 2016). Averaged word embeddings are used as the representation of the document. For preprocessing, all text is converted to lowercase and we remove all punctuation and stop words. SVM is used as the classifier. We tune the hyper-parameters of the SVM classifier using a grid-search based on 5-fold cross-validation performed on the training set, after that, we re-train the classifier with optimised hyper-parameters. This hyper-parameter tuning method is applied in RoBERTa + SVM as well.
RoBERTa + SVM: We use 768-dimensional word vectors generated by a pre-trained RoBERTa language model (Liu et al., 2019). We do not fine-tune the pre-trained language model and use the averaged word vectors as the representation of the document. Since all BERT-based models are configured to take as input a maximum of 512 tokens, we divided the long documents with W words into k = W/511 fractions, which is then fed into the model to infer the representation of each fraction (each fraction has a "<S>" token in front of 511 tokens, so, 512 tokens in total). Based on the approach of (Sun et al., 2020), the vector of each fraction is the average embeddings of words in that fraction, and the representation of the whole text sequence is the mean of all k fraction vectors. For preprocessing, the only operation performed is to convert all tokens to lowercase. SVM is used as the classifier.
Fine-tuned RoBERTa: For the document classification task, fine-tuning RoBERTa means adding a softmax layer on top of the RoBERTa encoder output and fine-tuning all parameters in the model. In this experiment, we fine-tune the same 768-dimensional pre-trained RoBERTa model with a small training set. The settings of all hyper-parameters follow (Liu et al., 2019). we set the learning rate to 1*10-4 and the batch size to 4, and use the Adam optimizer with epsilon equals to 1*10-8 through hyperparameter tuning. However, since we assume that the amount of labelled data available for training is small, we do not have the luxury of a hold out validation set to use to implement early stopping during model fine tuning. Instead, after training for 15 epochs we roll back to the model with the lowest loss based on the training dataset. This rollback strategy is also applied to HAN and HBM due to the limited number of instances in training sets. For preprocessing, the only operation performed is to convert all tokens to lowercase.
Hierarchical Attention Network: Following (Yang et al., 2016), we apply two levels of Bi-GRU with attention mechanism for document classification. All words are first converted to word vectors using GloVe (Pennington et al., 2014) (300 dimension version pre-trained using the wiki gigaword corpus) and fed into a word-level Bi-GRU with attention mechanism to form sentence vectors. After that, a sentence vector along with its context sentence vectors are input into sentence-level Bi-GRU with attention mechanism to form the document representation which is then passed to a softmax layer for final prediction. For preprocessing, the only operation performed is to convert all tokens to lowercase, and separate documents into sentences. We apply Python NLTK sent_tokenize function to split documents into sentences.
Hierarchical BERT Model: For HBM, we set the number of BERT layers to 4, and the maximum number of sentences to 114, 64, 128, 128, 100, and 64 for the Movie Review, Multi-domain Customer Review, Blog Author Gender, Guardian 2013, Reuters and 20 Newsgroups datasets respectively, these values are based on the length of documents in these datasets. After some preliminary experiments, we set the attention head to 1, the learning rate to 2*10-5, dropout probability to 0.01, used 50 epochs, set the batch size to 4 and used the Adam optimizer with epsilon equals to 1*10-8. The only text preprocessing operation performed is to convert all tokens to lowercase and split documents into sentences. We apply Python NLTK sent_tokenize function to split documents into sentences.