Special symbols like © seem to mess with sourcekit results

Question

Special symbols like © seem to mess with sourcekit results

nathankot opened this issue 9 years ago · 9 comments

As per what @galeo discovered in nathankot/company-sourcekit#16, I'll post my findings here:

Completion without the copyright symbol, offset is at: CGRect(|):

$ sourcekitten complete --text '# ; import AVFoundation; CGRect()' --offset 32 | head

[{
  "sourcetext" : "origin: <#T##CGPoint#>, size: <#T##CGSize#>"
}, ... ]

Completion with the copyright symbol, offset is at CGRect(|):

$ sourcekitten complete --text '# ©; import AVFoundation; CGRect()' --offset 33 | head

[{
  "sourcetext" : "()",
}, ... ]

Completion with the copyright symbol, offset is (seemingly) incorrect at CGRect()|:

$ sourcekitten complete --text '# ©; import AVFoundation; CGRect()' --offset 34 | head

[{
  "sourcetext" : "origin: <#T##CGPoint#>, size: <#T##CGSize#>",
}, ... ]

It looks like xcode isn't considering the © character at all.

I'm not sure if this is desired behavior on Soucekit's part, but it'd be interesting to get your input guys @terhechte @seanfarley

seanfarley commented 8 years ago

👍

Answer 1 · 2016-05-10T10:57:42.000Z

I wonder if this also applies to other characters or if this is a special case with only the © symbol.

Answer 2 · 2016-05-10T12:21:21.000Z

Most likely applies to others as well

On Tue, May 10, 2016 at 7:57 PM, Benedikt Terhechte <
notifications@github.com> wrote:

I wonder if this also applies to other characters or if this is a special
case with only the © symbol.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#42 (comment)

Answer 3 · 2016-05-10T15:57:48.000Z

Does it only happen if it is in the same line? Or also if it is somewhere before the current cursor? To remedy this, we'd probably need to open the file, jump to the correct offset, and go back to make sure that none of those special characters are in there, right?

Answer 4 · 2016-05-10T18:41:52.000Z

I imagine this is due to unicode since some unicode characters count as more than one. That's a bit hand wavy, I realize, but if sourcekit is expecting a byte string (warning: this is just a guess), then the counting will be off with unicode. You can see this in python2:

$ python2.7 -c 'print len("😈")
4

$ python2.7 -c 'print len("©")'
2

Answer 5 · 2016-05-11T02:27:53.000Z

Nice :) I propose we fix this in either sourcekittendaemon or sourcekitten:

  7> "©".utf8.count
$R2: Distance = 2
  8> "©".characters.count
$R3: Distance = 1

Answer 6 · 2016-05-11T02:30:59.000Z

Actually, now that I think about it this really has to be fixed in the editor integrations doesn't it, otherwise the top layers would be needing to do magic translating a character offset to a utf8 offset.

Answer 7 · 2016-05-11T02:54:04.000Z

In emacs:

(position-bytes (point))

Answer 8 · 2016-05-11T03:20:48.000Z

This has been fixed in company-sourcekit :) I'll add a note to the readme for sourcekittendaemon and close this.