Thai Natural Language Processing (Thai NLP) Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
JTCC |
Thai Character Cluster |
Java |
|
GPL-3.0 |
Wittawat |
TCC |
Thai Character Cluster |
Python |
|
Apache 2.0 |
Wannaphong |
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
LK82 + Udom83 |
Thai Soundex |
Python |
|
|
Korakot |
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Swath |
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai |
C |
Longest Matching, Maximal Matching and Part-of-Speech Bigram. |
GPL |
CMU |
Lexto |
Lexto: Thai Lexeme Tokenizer |
Java |
|
LGPL |
NECTEC
|
Python 2 |
|
LGPL |
Python2 Wrapper |
Python 3 |
|
LGPL |
Python3 Wrapper |
Wordcut |
Thai word breaker for Node.js |
JavaScript, Node.JS |
|
LGPL-3.0 |
veer66, github |
wordcutpy |
A simple Thai word tokenizer written in 1 Python file |
Python 3 |
|
LGPL-3.0 |
veer66, github |
CutKum |
Thai Word-Segmentation with Deep Learning in Tensorflow. RNN. |
Python |
0.93 F-measure. |
MIT |
Pucktada, github |
DeepCut |
A Thai word tokenization library using Deep Neural Network. CNN. |
Python |
0.988 F-measure. |
MIT |
rkcosmos, github |
SynThai |
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. |
Python |
0.992 F-measure. |
MIT |
KenjiroAI, github |
Part of Speech Tagging (POS Tagging)
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Jitar+NAiST |
A simple Trigram HMM part-of-speech tagger |
Java |
|
|
Ver66, Jitar + NAiST, 1 + NAiST, 2 |
SynThai |
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM. |
Python |
0.9163 F-measure. RNN. LSTM |
MIT |
KenjiroAI, github |
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Named Entity Tagging (Thai NEST) |
Thai Named Entity tagging Specification and Tools |
|
|
GPL |
KINDML, SIIT, AIAT |
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
News Structure Tagging Program |
Thai News Structure Tagging Program |
|
Metadata tagging, Structure tagging, Automatic News Title Generation |
GPL |
AIAT |
Syntactic Parsing & Tools
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
Chart-parser |
Extract Syntactic Structure from POS Tagged Sentence. |
C |
|
All rights reserved |
Thanaruk T. (thanaruk@siit.tu.ac.th) |
Grammar Processing |
Labelled Brackets -> Context Free Grammars (CFGs) |
Python |
Transform and compute probability |
|
Thodsaporn C. |
Library |
Description |
Programming Languages |
Features |
License |
Author & Link |
kobkrit-word-embedding |
Tensorflow implementation of Thai word embedding |
Python |
Source code, Example, Word distance graph |
LGPL |
Kobkrit V. |
Dictionaries / Translation Pairs
Library |
Description |
Size |
Features |
License |
Link |
Transliteration Corpus |
|
31K pairs |
Thai-Eng Translation Pair |
CC BY-NC-SA 3.0 TH |
NECTEC |
Lexitron |
Opensource Thai-English Dictionary |
|
TH->EN, EN->TH |
LGPL |
NECTEC |
Library |
Description |
Size |
Features |
License |
Link |
ORCHID |
|
30K sent. |
Word Seg., POS Tagged. |
CC BY-NC-SA 3.0 TH |
NECTEC |
InterBEST 2009/2010 |
|
5M words |
Word Seg. |
CC BY-NC-SA 3.0 TH |
NECTEC |
Thai Wikipedia |
Formal Articles |
1.49GB (~213.1 MB compressed) |
XML |
GFDL |
WIKIPEDIA |
TNC Top-5000 Words |
Word frequency |
5,000 words |
Frequency of Thai words in various genres, EXCEL |
All rights reserved |
CHULA |
Click Bait Sentences |
Thai Click Bait Sentence |
330 sent. (90.7KB) |
|
MIT |
Wannaphongcom |
Thai Sentimental Word List |
Thai Sentimental Words List |
52KB |
Seperated Words as Adj, V |
MIT |
Wannaphongcom |
Prime Minister 29 |
Prime Minister 29's Speech Sentences |
338KB |
Word segged, Name Entity Tagged |
MIT |
Wannaphongcom |
Library |
Description |
Size |
Features |
License |
Link |
Thai National Corpus 2 |
|
32M words. |
Query text by genre, domain |
All rights reserved |
CHULA |
Thai Medical Document |
|
3,594 docs |
Document and dynamic keyword map |
All rights reserved |
KINDML, SIIT |
Southeast Asian Languages Library |
Thai News, Web Text, Pop Music, Literature, Toponyms |
20M chars |
Phase around a search text |
|
SEALang |
Pre-trained Model |
Description |
Size |
Dimensions |
License |
Link |
fastText |
Skip-Gram model trained on Wikipedia using fastText |
|
300 |
CC BY-SA 3.0 |
Facebook + Bin & Text + Text Only |
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
http://aiat.in.th/resources/