MicroText Tokenizer
The MicroText Tokenizer is a Java tokenization library that can be used by NLP systems to operate on text from Twitter, SMS, Slack, and other messaging platforms. Micro-text refers to messages found in microblogging platforms like Twitter or communications through SMS or similar messaging platforms. Micro-text has several characteristics that make it different from text found in more traditional documents. Standard NLP tools perform poorly on text like the following:
Cake?@Username K gotta get the hair,eyebrows n nails
done today b4 9 then I WISH I had sum one I could go
CAKE wit :( sigh
This example demonstrates several of the challenges in tokenization of micro-text.