Incorrect indexing in tokens
Closed this issue · 1 comments
Kristober commented
I am getting some strange response when entering the text "Afþví að það er"
"result": [
[
{
"annotations": [
{
"code": "C002",
"detail": null,
"end": 1,
"end_char": 4,
"references": [],
"start": 0,
"start_char": 0,
"suggest": "Af því",
"suggestlist": null,
"text": "Orðinu 'Afþví' var skipt upp"
},
{
"code": "P_NT_FsMeðFallstjórn",
"detail": "Forsetningin 'að' stýrir þágufalli.",
"end": 3,
"end_char": 11,
"references": [],
"start": 3,
"start_char": 8,
"suggest": "því",
"suggestlist": null,
"text": "Á sennilega að vera 'því'"
}
],
"corrected": "Af því að það er",
"nonce": "81377724",
"original": "Afþví að það er",
"token": "53fdf5f5a962ac05e68bc1b960703777069b8e4dceaca1fed1c1609e247eb5ea",
"tokens": [
{
"i": 0,
"k": 6,
"o": "Afþví",
"x": "Af"
},
{
"i": 5,
"k": 6,
"o": "því",
"x": "því"
},
{
"i": 5,
"k": 6,
"o": " að",
"x": "að"
},
{
"i": 8,
"k": 6,
"o": " það",
"x": "það"
},
{
"i": 12,
"k": 6,
"o": " er",
"x": "er"
}
]
}
]
],
The second annotation has a suggestion "því"
for indexes 8 to 11 where the input is "það"
which doesn't make sense.
Also "því"
is listed in tokens with i: 5
which indicates that it's in the same place as "að"
probably because it doesn't take into account that the word it is splitting from (afþví) changes in length.
Should "því"
be listed under tokens if "afþví"
is being listed there already or does the indexing need adjusting?
sveinbjornt commented
This is a GreynirCorrect issue