/japanese_text_classification

To investigate various DNN text classifiers including MLP, CNN, RNN, BERT approaches.

Primary LanguageJupyter Notebook

Text Classification with Various DNN Methods

While trying various DNN text classifier methods for a Japanese corpus, livedoor corpus, I aim to gain some knowledge and experice of DNN for NLP.

Tokenizer

Two major tokenizers below are experimented.
[1]MeCab + mecab-ipadicNEologd
Since the program is implemented in Python, mecab-python3 is also required to execute the program.
[2]SentencePiece
SentencePiece is trained on the Japanese Wikipedia Data dumps.
To train this, I referred to the webpage titled "Wikipediaから日本語コーパスを利用してSentencePieceでトークナイズ(分かち書き)".

Word Embedding

Japanese BERT

I used bert-japanese implemented by "yoheikikuta". Instead of using the trained SentencePiece and the pratrained BERT model, I trained them from scratch, only a few changes are listed below.

  • I trained SentencePiece with 8,000 words instead of 32,000.
  • I used newer Japanese Wikipedia dataset than the one he used.
  • I pretrained BERT model up to 1,300,000 steps instead of 1,400,000. The pretrained result is shown as below.
***** Eval results *****
global_step = 1300000
loss = 1.3378378
masked_lm_accuracy = 0.71464056
masked_lm_loss = 1.2572908
next_sentence_accuracy = 0.97375
next_sentence_loss = 0.08065516

Results

[1]MeCab + ipadicNEologd + fastText

  • MLP(Multi-layer Perceptron)
                precision    recall  f1-score   support           

dokujo-tsushin      0.721     0.886     0.795       175
  it-life-hack      0.789     0.825     0.806       154
 kaden-channel      0.856     0.856     0.856       167
livedoor-homme      0.820     0.439     0.571       114
   movie-enter      0.848     0.931     0.888       174
        peachy      0.801     0.701     0.748       184
          smax      0.930     0.930     0.930       186
  sports-watch      0.980     0.890     0.932       163
    topic-news      0.832     0.975     0.897       157

     micro avg      0.839     0.839     0.839      1474
     macro avg      0.842     0.826     0.825      1474
  weighted avg      0.843     0.839     0.834      1474
  • CNN
                  precision    recall  f1-score   support

dokujo-tsushin      0.935     0.909     0.922       175
  it-life-hack      0.906     1.000     0.951       154
 kaden-channel      1.000     0.970     0.985       167
livedoor-homme      0.968     0.798     0.875       114
   movie-enter      0.939     0.977     0.958       174
        peachy      0.900     0.929     0.914       184
          smax      0.984     0.978     0.981       186
  sports-watch      0.994     1.000     0.997       163
    topic-news      0.981     0.987     0.984       157

     micro avg      0.955     0.955     0.955      1474
     macro avg      0.956     0.950     0.952      1474
  weighted avg      0.956     0.955     0.954      1474
  • BiLSTM
                precision    recall  f1-score   support

dokujo-tsushin      0.850     0.874     0.862       175
  it-life-hack      0.851     0.929     0.888       154
 kaden-channel      0.957     0.928     0.942       167
livedoor-homme      0.786     0.675     0.726       114
   movie-enter      0.860     0.989     0.920       174
        peachy      0.886     0.761     0.819       184
          smax      0.957     0.957     0.957       186
  sports-watch      0.969     0.957     0.963       163
    topic-news      0.950     0.975     0.962       157

     micro avg      0.900     0.900     0.900      1474
     macro avg      0.896     0.894     0.893      1474
  weighted avg      0.900     0.900     0.899      1474

[2]SentencePiece

  • MLP
                precision    recall  f1-score   support

dokujo-tsushin      0.862     0.926     0.893       175
  it-life-hack      0.911     0.929     0.920       154
 kaden-channel      0.931     0.970     0.950       167
livedoor-homme      0.886     0.684     0.772       114
   movie-enter      0.945     0.983     0.963       174
        peachy      0.893     0.864     0.878       184
          smax      0.974     0.995     0.984       186
  sports-watch      0.974     0.914     0.943       163
    topic-news      0.909     0.955     0.932       157

     micro avg      0.922     0.922     0.922      1474
     macro avg      0.921     0.913     0.915      1474
  weighted avg      0.922     0.922     0.920      1474
  • CNN
                precision    recall  f1-score   support

dokujo-tsushin      0.965     0.937     0.951       175
  it-life-hack      0.962     0.994     0.978       154
 kaden-channel      1.000     0.982     0.991       167
livedoor-homme      0.903     0.816     0.857       114
   movie-enter      0.956     0.994     0.975       174
        peachy      0.937     0.962     0.949       184
          smax      0.989     1.000     0.995       186
  sports-watch      0.975     0.975     0.975       163
    topic-news      0.981     0.981     0.981       157

     micro avg      0.965     0.965     0.965      1474
     macro avg      0.963     0.960     0.961      1474
  weighted avg      0.965     0.965     0.965      1474
  • BiLSTM
                precision    recall  f1-score   support

dokujo-tsushin      0.927     0.943     0.935       175
  it-life-hack      0.936     0.955     0.945       154
 kaden-channel      0.970     0.964     0.967       167
livedoor-homme      0.930     0.702     0.800       114
   movie-enter      0.919     0.983     0.950       174
        peachy      0.891     0.935     0.912       184
          smax      0.969     0.995     0.981       186
  sports-watch      0.969     0.957     0.963       163
    topic-news      0.955     0.949     0.952       157

     micro avg      0.940     0.940     0.940      1474
     macro avg      0.941     0.931     0.934      1474
  weighted avg      0.941     0.940     0.939      1474
  • BERT
                precision    recall  f1-score   support

dokujo-tsushin      0.958     0.920     0.939       175
  it-life-hack      0.933     0.987     0.959       154
 kaden-channel      0.976     0.964     0.970       167
livedoor-homme      0.922     0.825     0.870       114
   movie-enter      0.944     0.977     0.960       174
        peachy      0.922     0.967     0.944       184
          smax      0.989     0.973     0.981       186
  sports-watch      1.000     0.982     0.991       163
    topic-news      0.969     0.987     0.978       157

      accuracy                          0.958      1474
     macro avg      0.957     0.954     0.955      1474
  weighted avg      0.958     0.958     0.958      1474

Conclusion

The best model among the 7 models above is CNN with Sentence Piece.
Results may be changed if you do more complicated classification tasks.
For each DNN model tested on both MeCab and Sentence Piece, such as MLP, CNN or biLSTM, a model that used Sentence Piece outperformed the one that used fastText+MeCab+ipadicNEologd.