Note: This project is mostly based on https://github.com/yuhaozhang/sentence-convnet
-
Run API Server
python ./main.py
-
Run HTML server
for example:python -m SimpleHTTPServer 5050
Access to http://localhost:5050/docs/
- Python 2.7
- Tensorflow (tested with version
0.10.0rc0-> 1.0) - Numpy
To train with pretrained embedding (train.py --use_pretrain=True
)
To download TED corpus (ted.py
)
To visualize (visualize.ipynb
)
Web API (main.py
)
-
TED Subtitle Corpus
./data/ted500
directory includes preprocessed data. To reproduce (2GB+ disk space required):python ./ted.py
-
Your own data
Put the data file per class, e.g.class_names = ['neg', 'pos']
:cnn-ld-tf ├── ... └── data └── mr ├── mr.neg # examples with class neg └── mr.pos # examples with class pos
Note:
- Data file encoding must be utf-8.
- One example per line.
- The number of examples of each class must be the same.
python ./util.py
python ./train.py
python ./predict.py
python ./eval.py
tensorboard --logdir=./model/ted500/summaries
CNN for text classification:
- https://github.com/yuhaozhang/sentence-convnet
- https://github.com/dennybritz/cnn-text-classification-tf
- http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
- http://tkengo.github.io/blog/2016/03/14/text-classification-by-cnn/
TED Corpus:
Language Detection:
Web API on heroku:
- Supported languages (65):
["ar", "az", "bg", "bn", "bo", "cs", "da", "de", "el", "en", "es", "fa", "fi", "fil", "fr", "gu", "he", "hi", "ht", "hu", "hy", "id", "is", "it", "ja", "ka", "km", "kn", "ko", "ku", "lt", "mg", "ml", "mn", "ms", "my", "nb", "ne", "nl", "nn", "pl", "ps", "pt", "ro", "ru", "si", "sk", "sl", "so", "sq", "sv", "sw", "ta", "te", "tg", "th", "tl", "tr", "ug", "uk", "ur", "uz", "vi", "zh-cn", "zh-tw"]
Details: please visit documentation