Bad tokenisatoin of ordinal numbers
Opened this issue · 0 comments
TomazErjavec commented
The tokeniser very often treats the period after a number as a separate token and the end of the sentence, even though the period is part of the (ordinal) number, which also the context makes obvoious. This is a serious bug.
Example (already formatted in xml):
<p>
<s>
5. 5. Kav
12 12 Kag
<g/>
. . U
</s>
<s>
2000 2000 Kag
</s>
</p>