mideind/Yfirlestur

Incorrect indexing in tokens

Closed this issue · 1 comments

I am getting some strange response when entering the text "Afþví að það er"

"result": [
        [
            {
                "annotations": [
                    {
                        "code": "C002",
                        "detail": null,
                        "end": 1,
                        "end_char": 4,
                        "references": [],
                        "start": 0,
                        "start_char": 0,
                        "suggest": "Af því",
                        "suggestlist": null,
                        "text": "Orðinu 'Afþví' var skipt upp"
                    },
                    {
                        "code": "P_NT_FsMeðFallstjórn",
                        "detail": "Forsetningin 'að' stýrir þágufalli.",
                        "end": 3,
                        "end_char": 11,
                        "references": [],
                        "start": 3,
                        "start_char": 8,
                        "suggest": "því",
                        "suggestlist": null,
                        "text": "Á sennilega að vera 'því'"
                    }
                ],
                "corrected": "Af því að það er",
                "nonce": "81377724",
                "original": "Afþví að það er",
                "token": "53fdf5f5a962ac05e68bc1b960703777069b8e4dceaca1fed1c1609e247eb5ea",
                "tokens": [
                    {
                        "i": 0,
                        "k": 6,
                        "o": "Afþví",
                        "x": "Af"
                    },
                    {
                        "i": 5,
                        "k": 6,
                        "o": "því",
                        "x": "því"
                    },
                    {
                        "i": 5,
                        "k": 6,
                        "o": "",
                        "x": ""
                    },
                    {
                        "i": 8,
                        "k": 6,
                        "o": " það",
                        "x": "það"
                    },
                    {
                        "i": 12,
                        "k": 6,
                        "o": " er",
                        "x": "er"
                    }
                ]
            }
        ]
    ],

The second annotation has a suggestion "því" for indexes 8 to 11 where the input is "það" which doesn't make sense.

Also "því" is listed in tokens with i: 5 which indicates that it's in the same place as "að" probably because it doesn't take into account that the word it is splitting from (afþví) changes in length.
Should "því" be listed under tokens if "afþví" is being listed there already or does the indexing need adjusting?

This is a GreynirCorrect issue