position_from_utf16 in workspace.py may return incorrect result.
pyscripter opened this issue ยท 4 comments
The following code that mimics the way position_from_utf16 calculates the result, demonstrates the issue:
def utf16_unit_offset(chars: str):
"""Calculate the number of characters which need two utf-16 code units.
Arguments:
chars (str): The string to count occurrences of utf-16 code units for.
"""
return sum(ord(ch) > 0xFFFF for ch in chars)
s = '๐๐'
def position_from_utf16(line, pos):
return pos - utf16_unit_offset(line[:pos])
print(position_from_utf16(s, 2))
The result is zero, but it should be one.
Although, this may not have a large real-world impact (you don't get many high-plane chars in python code), it would be nice to have a correct calculation.
This is valid python code(!) that may cause the above failure:
def ใจใจ่่๐คฏ๐คฏ():
pass
if __name__ == '__main__':
ใจใจ่่๐คฏ๐คฏ()
Unfortunately I don't know this part of the LSP spec well enough to understand why the correct result is one!
print(position_from_utf16(s, 2))
I assume this is simulating a client sending a position that is after the first emoji? Something like
๐|๐
where |
represents the cursor. And that the correct result is 1
since in python strings this is the index after the first character?
If so... would the fix be to do something like
def position_from_utf16(line, pos):
idx = max(0, pos-1)
return pos - utf16_unit_offset(line[:idx])
so that the second emoji is not passed to the utf16_unit_offset
function?
I assume this is simulating a client sending a position that is after the first emoji? Something like
Yes
If so... would the fix be to do something like
No that would not work
Something like:
def position_from_utf16(line, pos):
l = len(line)
i = p = 0
while (i < l) and (p < pos):
p += 2 if (ord(line[i]) > 0xFFFF) else 1
i += 1
return i
should work.