saurabhshri/CCAligner

Find and integrate a text tokenisation library.

saurabhshri opened this issue · 2 comments

The current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols etc..

For example :

In 1996, 1996 people sent emails at someone @ example . com at 1:30 PM.

In nineteen ninety six, one thousand nine hundred and ninety six people sent emails at someone at example dot com at one thirty p m

and all the alternative versions.

The library needs to be integrated in subtitle parser (srtparser.h).

@nshmyrev Thanks! That looks really nice! :)