/MicroTextTokenizer

Library for tokenizing micro-text from sources like Twitter and SMS messaging

Primary LanguageJavaBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

MicroText Tokenizer

The MicroText Tokenizer is a Java tokenization library that can be used by NLP systems to operate on text from Twitter, SMS, Slack, and other messaging platforms. Micro-text refers to messages found in microblogging platforms like Twitter or communications through SMS or similar messaging platforms. Micro-text has several characteristics that make it different from text found in more traditional documents. Standard NLP tools perform poorly on text like the following:

Cake?@Username K gotta get the hair,eyebrows n nails
done today b4 9 then I WISH I had sum one I could go
CAKE wit :( sigh

This example demonstrates several of the challenges in tokenization of micro-text.