etoken ------ *UPDATE* The following problem in the Perl module has since been resolved. An existence-table-based tokenizer implemented with hashes of hashes. And that's it with the fancy words from me. I am not a C programmer. This little program is a solution to the problem posed at https://github.com/glts/Lingua-Deva/blob/master/lib/Lingua/Deva.pm#L187 Given a list of token definitions, ie. the existence table, a ae abc a hash of hashes is constructed so as to serve as a model of the structure of all valid tokens. At second thought some kind of tree structure might have been more appropriate. a* / \ e* b / \ c* d* This structure is then traversed repeatedly during the tokenization of some input. Since possible endpoints are marked specially in the hash of hashes all and only valid tokens are recognized. For the hash of hashes pictured above, while "abc" would be a valid token, "ab" would not. The Perl function given above has the flaw that it requires all possible prefixes of a token to be tokens too: A token "abcd" requires tokens "abc", "ab", "a" to exist. Etoken does not have this limitation. On the other hand, the Perl version does Unicode case folding; it would be foolish to reimplement that in C, so etoken only does primitive ASCII case folding (optional). Check out the example run in main.c, which will read the token definitions in "exampledef.txt" and print the separate tokens in "example.txt": make ./etoken The Perl script etoken.pl does the same thing. For large data, the C version is almost 20 times as fast on my machine.