clarinsi/Obeliks4J

Bad tokenisatoin of ordinal numbers

Opened this issue · 0 comments

The tokeniser very often treats the period after a number as a separate token and the end of the sentence, even though the period is part of the (ordinal) number, which also the context makes obvoious. This is a serious bug.
Example (already formatted in xml):

<p>
<s>
5.      5.      Kav
12      12      Kag
<g/>
.       .       U
</s>
<s>
2000    2000    Kag
</s>
</p>