position_from_utf16 in workspace.py may return incorrect result.

Question

position_from_utf16 in workspace.py may return incorrect result.

pyscripter opened this issue 2 years ago · 4 comments

The following code that mimics the way position_from_utf16 calculates the result, demonstrates the issue:

def utf16_unit_offset(chars: str):
    """Calculate the number of characters which need two utf-16 code units.
    Arguments:
        chars (str): The string to count occurrences of utf-16 code units for.
    """
    return sum(ord(ch) > 0xFFFF for ch in chars)

s = '😋😋'

def position_from_utf16(line, pos):
    return pos - utf16_unit_offset(line[:pos])

print(position_from_utf16(s, 2))

The result is zero, but it should be one.

Although, this may not have a large real-world impact (you don't get many high-plane chars in python code), it would be nice to have a correct calculation.

Answer 1 · 2022-12-09T09:56:14.000Z

This is valid python code(!) that may cause the above failure:

def 㒨㒨虁虁𤯒𤯒():
    pass

if __name__ == '__main__':
    㒨㒨虁虁𤯒𤯒()

Answer 2 · 2022-12-13T20:23:29.000Z

Unfortunately I don't know this part of the LSP spec well enough to understand why the correct result is one!

print(position_from_utf16(s, 2))

I assume this is simulating a client sending a position that is after the first emoji? Something like

😋|😋

where | represents the cursor. And that the correct result is 1 since in python strings this is the index after the first character?

If so... would the fix be to do something like

def position_from_utf16(line, pos):
    idx = max(0, pos-1)
    return pos - utf16_unit_offset(line[:idx])

so that the second emoji is not passed to the utf16_unit_offset function?

Answer 3 · 2022-12-13T21:29:06.000Z

I assume this is simulating a client sending a position that is after the first emoji? Something like

Yes

If so... would the fix be to do something like

No that would not work

Something like:

def position_from_utf16(line, pos):
    l = len(line)
    i = p = 0
    while (i < l) and (p < pos):
        p += 2 if (ord(line[i]) > 0xFFFF) else 1
        i += 1
    return i

should work.

Answer 4 · 2022-12-14T19:19:50.000Z

Can you have a look at #304 please?