synalp/jtrans

Token lost in alignment

Opened this issue · 1 comments

I have // tokens in a trs which represent sentence boundaries. Unfortunately, they do not appear in the alignment. Is it possible to keep them?

It should be possible to transfer "silent" tokens, or other kinds of annotation through the aligner up to the output, but we may need to modify the I/O modules for every one of them. It's also related to whether JTrans should preserve the full sets of incoming annotations - whatever they are, with related issues such as the input and output tokens that are not the same, etc... I wonder if a more general solution to this kinds of feature wouldn't be to string-align the text @ output of JTrans with the text @ input, and project the output timestamps onto the text @ input, so that in the end JTrans has just enriched the input annotations, but has not altered them. But this typically should be done outside JTrans, because it depends on the formats of input files...