synalp/jtrans

Aligning without modifying tokenization

Closed this issue · 2 comments

It would be great if jtrans could align text without modifying its tokenization, only giving timing at token boundaries.

Also, what are tokenization rules in jtrans?

Tokens are separated on whitespace and punctuation marks (including apostrophes). For more details, see https://github.com/synalp/jtrans/blob/master/src/fr/loria/synalp/jtrans/markup/in/RawTextLoader.java#L19

Since none of the input formats handled by JTrans provide a tokenization, there's nothing to keep intact, so to speak.