Model | Dev | Test |
---|---|---|
TextCNN | 92.07 / 91.46 | 93.56 / 92.95 |
Bi-LSTM | 92.26 / 91.69 | 93.54 / 92.64 |
RCNN | 92.73 / 92.31 | 94.08 / 93.48 |
BERT | 95.28 / 94.82 | 95.50 / 95.00 |
Notes:
- The results report the highest accuracy and the mean accuracy of 10 times running respectively.
- The Bi-LSTM consists of 3 bidirectional LSTM layers, while the RCNN achieves the results with only 1 bidirectional layer
cd data
tar -zxf chnsenticorp.tgz
pip whenever there is any missing module
Below, I document some notes about Chinese text segmentation and pretrained models
Needed by: TextCNN, RNN, RCNN
Since there are no blanks in Chinese, I need some text segmentation tool to preprocess each sentence into a sequence of words.
Here, I adopt jieba, which is popular among Chinese NLP communities.
Then, I use the pretrained word embeddings sgns.renmin.bigram-char to convert each discrete word into a dense word vector.
It's quite easy to implement a bert-based model with the interfaces provided by hugging face.
But actually it took me a while to read through its implementation. For the seek of trade-off between easy adoption and software engineering, the details behind those interfaces are quite complicated.
I recommend this handout for breaking down a transformer into pieces and understanding this model.
I implement a dataloader from scratch to package data samples into batches.
I knew that there were some helpful utility modules, such as TorchText, which could facilitate the data reading procedure.
But when I started this experiment (2020.1), the TorchText version was 0.4, and there weren't any thorough documentations or tutorials about this module.
What's more, after investigating this module for a while, I found that it's hard to prepare data samples for BERT.
python main.py --model_name=cnn --dataset_dir=data/chnsenticorp --freeze=True --repeat=10 --epochs=100 --save_dir=.
python main.py --model_name=rnn --dataset_dir=data/chnsenticorp --freeze=True --include_length=True --repeat=10 --epochs=100 --save_dir=.
python main.py --model_name=rcnn --dataset_dir=data/chnsenticorp --freeze=True --include_length=True --repeat=10 --epochs=100 --save_dir=.
python main.py --model_name=bert --dataset_dir=data/chnsenticorp --freeze=True --include_length=True --max_len=128 --batch_size=32 --repeat=10 --epochs=100 --save_dir=.
Due to the limited memory of GPU, you should carefully set the max_len and batch_size.
However, the upper limit of max_len is 512, because the position embedding in BERT can only support sentences with at most 512 tokens.