Aligning without modifying tokenization
Closed this issue · 2 comments
benob commented
It would be great if jtrans could align text without modifying its tokenization, only giving timing at token boundaries.
benob commented
Also, what are tokenization rules in jtrans?
jorio commented
Tokens are separated on whitespace and punctuation marks (including apostrophes). For more details, see https://github.com/synalp/jtrans/blob/master/src/fr/loria/synalp/jtrans/markup/in/RawTextLoader.java#L19
Since none of the input formats handled by JTrans provide a tokenization, there's nothing to keep intact, so to speak.