Convolutional Neural Network for Language Detection

Note: This project is mostly based on https://github.com/yuhaozhang/sentence-convnet

Demo

To train with pretrained embedding (train.py --use_pretrain=True)

To download TED corpus (ted.py)

To visualize (visualize.ipynb)

Web API (main.py)

TED Subtitle Corpus
./data/ted500 directory includes preprocessed data. To reproduce (2GB+ disk space required):
```
python ./ted.py
```
Your own data
Put the data file per class, e.g. class_names = ['neg', 'pos']:
```
cnn-ld-tf
├── ...
└── data
    └── mr
        ├── mr.neg  # examples with class neg
        └── mr.pos  # examples with class pos
```
Note:
- Data file encoding must be utf-8.
- One example per line.
- The number of examples of each class must be the same.

python ./util.py

python ./train.py

python ./predict.py

python ./eval.py

tensorboard --logdir=./model/ted500/summaries

CNN for text classification:

TED Corpus:

Language Detection:

Web API on heroku:

Supported languages (65):
["ar", "az", "bg", "bn", "bo", "cs", "da", "de", "el", "en", "es", "fa", "fi", "fil", "fr", "gu", "he", "hi", "ht", "hu", "hy", "id", "is", "it", "ja", "ka", "km", "kn", "ko", "ku", "lt", "mg", "ml", "mn", "ms", "my", "nb", "ne", "nl", "nn", "pl", "ps", "pt", "ro", "ru", "si", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "tg", "th", "tl", "tr", "ug", "uk", "ur", "uz", "vi", "zh-cn", "zh-tw"]

Details: please visit documentation