/cnn-ld-tf

Convolutional Neural Network for Language Detection in Tensorflow

Primary LanguageJupyter Notebook

Convolutional Neural Network for Language Detection

Note: This project is mostly based on https://github.com/yuhaozhang/sentence-convnet


Demo

  1. Run API Server

    python ./main.py
  2. Run HTML server
    for example:

    python -m SimpleHTTPServer 5050
    

    Access to http://localhost:5050/docs/


Requirements

To train with pretrained embedding (train.py --use_pretrain=True)

To download TED corpus (ted.py)

To visualize (visualize.ipynb)

Web API (main.py)

Data

  • TED Subtitle Corpus
    ./data/ted500 directory includes preprocessed data. To reproduce (2GB+ disk space required):

    python ./ted.py
  • Your own data
    Put the data file per class, e.g. class_names = ['neg', 'pos']:

    cnn-ld-tf
    ├── ...
    └── data
        └── mr
            ├── mr.neg  # examples with class neg
            └── mr.pos  # examples with class pos
    

    Note:

    • Data file encoding must be utf-8.
    • One example per line.
    • The number of examples of each class must be the same.

Preprocess

python ./util.py

Training

python ./train.py

Prediction

python ./predict.py

Evaluation

python ./eval.py

Run TensorBoard

tensorboard --logdir=./model/ted500/summaries

Embeddings by script name

References

CNN for text classification:

TED Corpus:

Language Detection:

Web API on heroku:

Pre-trained model

  • Supported languages (65):
    ["ar", "az", "bg", "bn", "bo", "cs", "da", "de", "el", "en", "es", "fa", "fi", "fil", "fr", "gu", "he", "hi", "ht", "hu", "hy", "id", "is", "it", "ja", "ka", "km", "kn", "ko", "ku", "lt", "mg", "ml", "mn", "ms", "my", "nb", "ne", "nl", "nn", "pl", "ps", "pt", "ro", "ru", "si", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "tg", "th", "tl", "tr", "ug", "uk", "ur", "uz", "vi", "zh-cn", "zh-tw"]

Details: please visit documentation