openlawlibrary/pygls

position_from_utf16 in workspace.py may return incorrect result.

pyscripter opened this issue ยท 4 comments

The following code that mimics the way position_from_utf16 calculates the result, demonstrates the issue:

def utf16_unit_offset(chars: str):
    """Calculate the number of characters which need two utf-16 code units.
    Arguments:
        chars (str): The string to count occurrences of utf-16 code units for.
    """
    return sum(ord(ch) > 0xFFFF for ch in chars)

s = '๐Ÿ˜‹๐Ÿ˜‹'

def position_from_utf16(line, pos):
    return pos - utf16_unit_offset(line[:pos])

print(position_from_utf16(s, 2))

The result is zero, but it should be one.

Although, this may not have a large real-world impact (you don't get many high-plane chars in python code), it would be nice to have a correct calculation.

This is valid python code(!) that may cause the above failure:

def ใ’จใ’จ่™่™๐คฏ’๐คฏ’():
    pass

if __name__ == '__main__':
    ใ’จใ’จ่™่™๐คฏ’๐คฏ’()

Unfortunately I don't know this part of the LSP spec well enough to understand why the correct result is one!

print(position_from_utf16(s, 2))

I assume this is simulating a client sending a position that is after the first emoji? Something like

๐Ÿ˜‹|๐Ÿ˜‹

where | represents the cursor. And that the correct result is 1 since in python strings this is the index after the first character?

If so... would the fix be to do something like

def position_from_utf16(line, pos):
    idx = max(0, pos-1)
    return pos - utf16_unit_offset(line[:idx])

so that the second emoji is not passed to the utf16_unit_offset function?

I assume this is simulating a client sending a position that is after the first emoji? Something like

Yes

If so... would the fix be to do something like

No that would not work

Something like:

def position_from_utf16(line, pos):
    l = len(line)
    i = p = 0
    while (i < l) and (p < pos):
        p += 2 if (ord(line[i]) > 0xFFFF) else 1
        i += 1
    return i

should work.

tombh commented

Can you have a look at #304 please?