Thai Natural Language Processing (Thai NLP) Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Thai NLP Libraries/Services
Library
Description
Programming Languages
Features
License
Author & Link
JTCC
Thai Character Cluster
Java
GPL-3.0
Wittawat
TCC
Thai Character Cluster
Python
Apache 2.0
Wannaphong
Library
Description
Programming Languages
Features
License
Author & Link
sentiment_analysis_thai
JagerV3
Library
Description
Programming Languages
Features
License
Author & Link
LK82 + Udom83
Thai Soundex
Python
Korakot
Library
Description
Programming Languages
Features
License
Author & Link
Swath
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai
C
Longest Matching, Maximal Matching and Part-of-Speech Bigram.
GPL
CMU
Lexto
Lexto: Thai Lexeme Tokenizer
Java
LGPL
NECTEC
Python 2
LGPL
Python2 Wrapper
Python 3
LGPL
Python3 Wrapper
Wordcut
Thai word breaker for Node.js
JavaScript, Node.JS
LGPL-3.0
veer66, github
wordcutpy
A simple Thai word tokenizer written in 1 Python file
Python 3
LGPL-3.0
veer66, github
CutKum
Thai Word-Segmentation with Deep Learning in Tensorflow. RNN.
Python
93% F-measure.
MIT
Pucktada, github
Thai Language Toolkit (tltk)
Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)
Python
97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)
GPLv3
awirote, the Python Package Index
DeepCut
A Thai word tokenization library using Deep Neural Network. CNN.
Python
98.8% F-measure.
MIT
rkcosmos, github
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
99.2% F-measure
MIT
KenjiroAI, github
CutThai
Thai word segmentation written in coffee-script Edit
Coffee-script
MIT
Pureexe/cutthai Github
Multi-Candidate-Word-Segmentation
Multi Candidate Word Segmentation for Thai language
Python, RNN, LSTM
97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)
MIT
Paper , earthy123/Multi-Candidate-Word-Segmentation
Part of Speech Tagging (POS Tagging)
Library
Description
Programming Languages
Features
License
Author & Link
Jitar+NAiST
A simple Trigram HMM part-of-speech tagger
Java
Ver66 , Jitar + NAiST, 1 + NAiST, 2
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
0.9163 F-measure. RNN. LSTM
MIT
KenjiroAI, github
Library
Description
Programming Languages
Features
License
Author & Link
Named Entity Tagging (Thai NEST)
Thai Named Entity tagging Specification and Tools
GPL
KINDML, SIIT , AIAT
Library
Description
Programming Languages
Features
License
Author & Link
News Structure Tagging Program
Thai News Structure Tagging Program
Metadata tagging, Structure tagging, Automatic News Title Generation
GPL
AIAT
Syntactic Parsing & Tools
Library
Description
Programming Languages
Features
License
Author & Link
Chart-parser
Extract Syntactic Structure from POS Tagged Sentence.
C
All rights reserved
Thanaruk T. (thanaruk@siit.tu.ac.th )
Grammar Processing
Labelled Brackets -> Context Free Grammars (CFGs)
Python
Transform and compute probability
Thodsaporn C.
Library
Description
Programming Languages
Features
License
Author & Link
kobkrit-word-embedding
Tensorflow implementation of Thai word embedding
Python
Source code, Example, Word distance graph
LGPL
Kobkrit V.
Thai Question Answering (Machine Comprehension)
Service
Description
License
Author & Link
Thai Machine Comprehension (ThaiMC)
Bidirectional Attention Flow
Copyright (As the service)
iApp-AI
Dictionaries / Translation Pairs
Library
Description
Size
Features
License
Link
Transliteration Corpus
31K pairs
Thai-Eng Translation Pair
CC BY-NC-SA 3.0 TH
NECTEC
LEXiTRON
Thai<->English Dictionary
TH->EN, EN->TH
LEXiTRON License
NECTEC
Yaitron
LEXiTRON in machine readable format (XML)
TH->EN, EN->TH
LEXiTRON License
Veer66 Schema , Data & Conversion Code
Library
Description
Size
Features
License
Link
Thai National Corpus 2
32M words
Query text by genre, domain
All rights reserved
CHULA
Thai Medical Document
3,594 docs
Document and dynamic keyword map
All rights reserved
KINDML, SIIT
Southeast Asian Languages Library
Thai News, Web Text, Pop Music, Literature, Toponyms
20M chars
Phase around a search text
SEALang
HSE Thai Corpus
Modern texts written in Thai language (mostly news websites)
50M tokens
Query by word form, lexeme, translation, grammatical attributes, lexical attributees
HSE School of Linguistics
Pre-trained Model
Description
Size
Dimensions
License
Link
fastText
Skip-Gram model trained on Wikipedia using fastText
300
CC BY-SA 3.0
Facebook + Bin & Text + Text Only
thai2vec v0.2
ULMFit on Wikipedia. Perplexity of 34.9 with 60,002 embeddings.
70MB
300
MIT
thai2vec / pyThaiNLP
Text Classification Benchmarks
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
http://aiat.in.th/resources/