Implementation of Convolutional Neural Networks for Sentence Classification using PyTorch. And using Bidirectional Encoder Representations from Transformers instead of word2vec for embedding words.
Yoon Kim, Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746 - 1751, 2014.
- Python: 3.8.0 or higher
- PyTorch: 1.6.0 or higher
- Optuna: 2.0.0 or higher
- Transformers from huggingface 3.0.2 or higher
If you installed Anaconda, you can create a virtual enviroment from env.yml
.
$ conda env create -f env.yml
If you use sample dataset, download RCV1-ids and put the folder that unzipped the id.zip
into data/.
This is RCV1 dataset that raw texts converted to ids by BERT tokenizer
.
Raw texts means it didn't normalize.
Embedding's Max length is 512 by BERT-base. So, used only 512 words from the beggining of each texts.
This dataset splitted 23,149 training sample and 781,265 testing sample according to RCV1: A New Benchmark Collection for Text Categorization Research.
If you want to use original or another dataset, you should convert dataset to ids by BERT tokenizer
and you should change format.
- one document per line
- line must has 'tokenized ids' and 'label'
- label must be represented in one hot vector
example
[id1, id2, id3, ...]<TAB>[000100...]
And you should fix some parameters in run.py such as classes
and embedding_dim
.
huggingface publish some pre-trained models including BERT.
This program uses base-uncased
model that one of pre-trained models.
So, embeds a word into a 768-dimensional vector by BERT.
And this program turn off the fine-tuning of BERT.
This program uses Precision@k, MicroF1 and MacroF1.
$ python run.py
$ python run.py --tuning
$ python run.py --no_cuda
This program is based on the following repositories. Thank you very much for their accomplishments.
- siddsax/XML-CNN (MIT license)
- yu54ku/xml-cnn (MIT license)