word-segmentation

There are 141 repositories under word-segmentation topic.

google/sentencepiece
Unsupervised text tokenizer for Neural Network-based text generation.
Language:C++10.5k 125 7571.2k
baidu/lac
百度NLP：分词，词性标注，命名实体识别，词重要性
Language:C++3.9k 105 248595
wolfgarbe/SymSpell
SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Language:C#3.2k 71 97299
PyThaiNLP/pythainlp
Thai natural language processing in Python
Language:Python998 46 373273
VKCOM/YouTokenToMe
Unsupervised text tokenizer focused on computational efficiency
Language:C++965 26 60102
mammothb/symspellpy
Python port of SymSpell: 1 million times faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm
Language:Python810 16 91121
ckiplab/ckip-transformers
CKIP Transformers
Language:Python708 15 3274
cbaziotis/ekphrasis
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
Language:Python664 19 2890
vncorenlp/VnCoreNLP
A Vietnamese natural language processing toolkit (NAACL 2018)
Language:Java601 31 47148
bab2min/Kiwi
Kiwi(지능형 한국어 형태소 분석기)
Language:C++554 17 8850
JayYip/m3tl
BERT for Multitask Learning
Language:Jupyter Notebook547 19 66125
modelscope/AdaSeq
AdaSeq: An All-in-One Library for Developing State-of-the-Art Sequence Understanding Models
Language:Python429 13 4338
taishi-i/nagisa
A Japanese tokenizer based on recurrent neural networks
Language:Python394 12 3023
ku-nlp/jumanpp
Juman++ (a Morphological Analyzer Toolkit)
Language:C++381 31 11044
jacksonllee/pycantonese
Cantonese Linguistics and NLP
Language:Python364 20 4438
yongzhuo/Pytorch-NLU
Pytorch-NLU，一个中文文本分类、序列标注工具包，支持中文长文本、短文本的多类、多标签分类任务，支持中文命名实体识别、词性标注、分词、抽取式文本摘要等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of spee
Language:Python342 4 1351
bab2min/kiwipiepy
Python API for Kiwi
Language:Python297 7 9630
jidasheng/bi-lstm-crf
A PyTorch implementation of the BI-LSTM-CRF model.
Language:Python247 9 1448
monpa-team/monpa
MONPA 罔拍是一個提供正體中文斷詞、詞性標註以及命名實體辨識的多任務模型
Language:Python245 23 1626
ikegami-yukino/mecab
This repository is for building Windows 64-bit MeCab binary and improving MeCab Python binding.
Language:C++244 11 515
fastcws/fastcws
轻量级高性能中文分词项目
Language:C++198 3 38
taishi-i/toiro
A comparison tool of Japanese tokenizers
Language:Python120 7 29
ckiplab/ckipnlp
CKIP CoreNLP Toolkits
Language:Python117 9 3515
Ailln/nlp-roadmap
🗺️ 一个自然语言处理的学习路线图
108 4 012
peterolson/hanzi-tools
Converts from Chinese characters to pinyin, between simplified and traditional, and does word segmentation.
Language:JavaScript105 8 1017
fudannlp16/CWS_Dict
Source codes for paper "Neural Networks Incorporating Dictionaries for Chinese Word Segmentation", AAAI 2018
Language:Python91 4 1632
jcyk/CWS
Source code for an ACL2016 paper of Chinese word segmentation
Language:Python81 7 426
wolfgarbe/WordSegmentationTM
Fast Word Segmentation with Triangular Matrix
Language:C#80 5 214
datquocnguyen/RDRsegmenter
A Fast and Accurate Vietnamese Word Segmenter (LREC 2018)
Language:Java79 8 011
ruanchaves/hashformers
Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).
Language:Python70 6 124
phongnt570/UETsegmenter
A toolkit for Vietnamese word segmentation
Language:Java69 10 713
MighTguY/customized-symspell
Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm
Language:Java66 4 1618
ye-kyaw-thu/sylbreak
Syllable segmentation tool for Myanmar language (Burmese) by Ye.
Language:HTML56 7 219
dnanhkhoa/python-vncorenlp
A Python wrapper for VnCoreNLP using a bidirectional communication channel.
Language:Python55 4 1018
undertheseanlp/word_tokenize
Vietnamese Word Tokenize
Language:Python51 7 227
giganticode/codeprep
A toolkit for pre-processing large source code corpora
Language:Python46 5 711

word-segmentation

google/sentencepiece

baidu/lac

wolfgarbe/SymSpell

PyThaiNLP/pythainlp

VKCOM/YouTokenToMe

mammothb/symspellpy

ckiplab/ckip-transformers

cbaziotis/ekphrasis

vncorenlp/VnCoreNLP

bab2min/Kiwi

JayYip/m3tl

modelscope/AdaSeq

taishi-i/nagisa

ku-nlp/jumanpp

jacksonllee/pycantonese

yongzhuo/Pytorch-NLU

bab2min/kiwipiepy

jidasheng/bi-lstm-crf

monpa-team/monpa

ikegami-yukino/mecab

fastcws/fastcws

taishi-i/toiro

ckiplab/ckipnlp

Ailln/nlp-roadmap

peterolson/hanzi-tools

fudannlp16/CWS_Dict

jcyk/CWS

wolfgarbe/WordSegmentationTM

datquocnguyen/RDRsegmenter

ruanchaves/hashformers

phongnt570/UETsegmenter

MighTguY/customized-symspell

ye-kyaw-thu/sylbreak

dnanhkhoa/python-vncorenlp

undertheseanlp/word_tokenize

giganticode/codeprep