`ZkNewFromTitleSelection` does not copy last character correctly if it's non-ascii

Check if applicable

I have searched the existing issues (required)
I'm willing to help fix the problem and contribute a pull request

Describe the bug

Same problem #125 reports, in my case the character is Cyrillic.

The created title ends with � and there is extra code left after the link in original note.

How to reproduce?

Add a note with content: "Текст заголовка"
Visually select "Текст заголовка"
Run :'<,'>ZkNewFromTitleSelection
Check the title of the newly created file
Check the link inserted in the original file

zk configuration

# NOTE SETTINGS

# Defines the default options used when generating new notes.
[note]

# Language used when writing notes.
# This is used to generate slugs or with date formats.
language = "ru"

# The default title used for new note, if no `--title` flag is provided.
default-title = "Untitled"

# Template used to generate a note's filename, without extension.
# filename = "{{format-date now '%Y%m%d%H%M%S'}}"
filename = "{{id}}"

# The file extension used for the notes.
extension = "md"

# Template used to generate a note's content.
# If not an absolute path, it is relative to .zk/templates/
template = "default.md"

# Configure random ID generation.

# The charset used for random IDs.
id-charset = "alphanum"

# Length of the generated IDs.
id-length = 4

# Letter case for the random IDs.
id-case = "lower"

# Path globs ignored while indexing existing notes.
ignore = [
  ".obsidian",
	"templates"
]


# EXTRA VARIABLES
#
# A dictionary of variables you can use for any custom values when generating
# new notes. They are accessible in templates with {{extra.<key>}}
[extra]

#key = "value"


# GROUP OVERRIDES
[group.journal]
paths = ["journal"]

[group.journal.note]
filename = "{{format-date now '%Y-%m-%d'}}"
extension = "md"
template = "journal.md"


# MARKDOWN SETTINGS
[format.markdown]

# Format used to generate links between notes.
# Either "wiki", "markdown" or a custom template. Default is "markdown".
link-format = "wiki"
# Indicates whether a link's path will be percent-encoded.
# Defaults to true for "markdown" format and false for "wiki" format.
link-encode-path = true
# Indicates whether a link's path file extension will be removed.
# Defaults to true.
link-drop-extension = true

# Enable support for #hashtags.
hashtags = true
# Enable support for :colon:separated:tags:.
colon-tags = false
# Enable support for Bear's #multi-word tags#
# Hashtags must be enabled for multi-word tags to work.
multiword-tags = false


# EXTERNAL TOOLS
[tool]

# Default editor used to open notes. When not set, the EDITOR or VISUAL
# environment variables are used.
editor = "nvim"

# Pager used to scroll through long output. If you want to disable paging
# altogether, set it to an empty string "".
pager = "less -FIRX"

# Command used to preview a note during interactive fzf mode.
fzf-preview = "bat -p --color always {-1}"


# LSP
#
#   Configure basic editor integration for LSP-compatible editors.
#   See https://github.com/zk-org/zk/blob/main/docs/editors-integration.md
#
[lsp]

[lsp.diagnostics]
# Each diagnostic can have for value: none, hint, info, warning, error

# Report titles of wiki-links as hints.
#wiki-title = "hint"
# Warn for dead links between notes.
dead-link = "error"

[lsp.completion]
# Customize the completion pop-up of your LSP client.

# Show the note title in the completion pop-up, or fallback on its path if empty.
note-label = "{{title-or-path}}"
# Filter out the completion pop-up using the note title or its path.
note-filter-text = "{{title}} {{path}}"
# Show the note filename without extension as detail.
note-detail = ""

Neovim configuration

-- Bootstrap lazy.nvim
local lazypath = vim.fn.stdpath("data") .. "/lazy/lazy.nvim"
if not (vim.uv or vim.loop).fs_stat(lazypath) then
  local lazyrepo = "https://github.com/folke/lazy.nvim.git"
  local out = vim.fn.system({ "git", "clone", "--filter=blob:none", "--branch=stable", lazyrepo, lazypath })
  if vim.v.shell_error ~= 0 then
    vim.api.nvim_echo({
      { "Failed to clone lazy.nvim:\n", "ErrorMsg" },
      { out, "WarningMsg" },
      { "\nPress any key to exit..." },
    }, true, {})
    vim.fn.getchar()
    os.exit(1)
  end
end
vim.opt.rtp:prepend(lazypath)

-- Make sure to setup `mapleader` and `maplocalleader` before
-- loading lazy.nvim so that mappings are correct.
-- This is also a good place to setup other settings (vim.opt)
vim.g.mapleader = " "
vim.g.maplocalleader = "\\"

-- Setup lazy.nvim
require("lazy").setup({
  spec = {
    -- add your plugins here
    {
        "zk-org/zk-nvim",
        config = function()
          require("zk").setup({
            -- See Setup section below
	    picker = select,

      lsp = {
        -- `config` is passed to `vim.lsp.start_client(config)`
        config = {
          cmd = { "zk", "lsp" },
          name = "zk",
          -- on_attach = ...
          -- etc, see `:h vim.lsp.start_client()`
        },

        -- automatically attach buffers in a zk notebook that match the given filetypes
        auto_attach = {
          enabled = true,
          filetypes = { "markdown" },
        },
      },
          })
        end
      }
  },
  -- Configure any other settings here. See the documentation for more details.
  -- colorscheme that will be used when installing plugins.
  install = { colorscheme = { "habamax" } },
  -- automatically check for plugin updates
  checker = { enabled = true },
})

Environment

zk 0.14.1
system: Darwin 23.4.0 x86_64
NVIM v0.10.0
Build type: Release
LuaJIT 2.1.1720049189

I can confirm that when I copy and paste the text, and use it as the title of a new note, that I get the same errors as you report.

However, if I replace the 'a' with a regular typed 'a', there is no error.
I'm completely ignorant to Cryllic, and only a little learned in Unicode...
So I have to ask, is there a difference between the 'a' in "Текст заголовка" and a Latin 'a'? Your discussion title suggests that this is a Unicode 'a'? But (see below), it would seem like it is actually ascii (with g8 returning d0 b0)?..
As I see the 'k" has a different shape than a Latin 'k' from my keyboard, yet the 'a' looks identical to my eyes.

As, if I inspect the value of the letter in neovim:

The copy and pasted 'a':

:asci returns <a> 97, Hex 61, Octal 141
g8 returns d0 b0

And the 'a' typed on my keyboard (Australian layout):

:ascii returns <а> 1072, Hex 0430, Oct 2060, Digr a=
g8 returns 61

So if there is no difference between the shape / etc between Cryllic and Latin 'a', then it would seem your keyboard is outputting a different value for the same symbol. Which then makes me wonder where this bug would come from, whether it's from our side or the Cryllic language support in neovim / terminal?

It's interesting too, because the other 'a' in "Текст заголовка" is being rendered fine. So it would seem there's some conflict between the end of string data, and the ascii value your keyboard is using for 'a'...

Tested some cases:

Текст заголовка | adding whitespace after the string, and including it in the highlighted text when making a new note: no error
Текст заголовка | the same, but not including the white space in the highlighted text: error
Текст заголовiа replace Cryllic 'k' with Latin 'i': error
Текст заголовкаi add a Latin 'i' after the original 'buggy' 'a': no error

So there is something odd going on with the combination of the 'a' from your keyboard with the end of string data.
This could still be on our end. Perhaps we're chopping off some data when extracting the highlighted text. But damn... This is a pretty obscure bug and pretty beyond my paygrade... I'm not sure I can handle this in any timely manner and am hoping one of the other maintainers has some easy clarifications at hand...

For reference, the file I was working in was unicode: Unicode text, UTF-8 text

From :h unicode

Bytes which are not part of a valid UTF-8 byte sequence are handled like a
single character and displayed as , where "xx" is the hex value of the
byte.

And the byte in our case being: <b0>

While Latin a and Cyrillic а look similar, they are parts of two completely different charsets. Cyrillic is a separate writing system in the same way Korean hangul or Greek alphabet are. It is outside of regular ASCII range and can be only found in extended sets. It does fit UTF-8.

You can ignore my example above and use Фывя злдж, which are all Cyrillic letters that look nothing like Latin alphabet and they will all render fine except the last ж. #125 had exactly same issue with Japanese Hiragana and Emoji, so it's not an obscure bug with some particular character but rather an issue with the last character of a title in a non-Latin based language.

Ok cool, thanks for the extra info and context 👍

@WhyNotHugo
I just saw your blog post about non-english characters. And it was quite extensive. Would you have any immediate insights here by chance? 😇

I don't use the neovim plugin, but I do use the LSP. My own repro steps:

Type Текст заголовка
Select the whole line (shift+v)
Trigger code action New note in current directory

This is what renders:

[Текст заголовка](yfl6)<b3>оловка

The input text is 15 characters and 29 bytes:

Т 0xd0 0xa2
е 0xd0 0xb5
к 0xd0 0xba
с 0xd1 0x81
т 0xd1 0x82
  0x20
з 0xd0 0xb7
а 0xd0 0xb0
г 0xd0 0xb3
о 0xd0 0xbe
л 0xd0 0xbb
о 0xd0 0xbe
в 0xd0 0xb2
к 0xd0 0xba
а 0xd0 0xb0

There are 13 trailing characters after the link's closing parenthesis. They are the last 13 ~~characters~~ bytes from the input string. The first byte is not valid utf-8, the rest are the last six characters from the input.

Those 13 characters are as follows:

(invalid utf) 0xb3
о 0xd0 0xbe
л 0xd0 0xbb
о 0xd0 0xbe
в 0xd0 0xb2
к 0xd0 0xba
а 0xd0 0xb0

I suspect that the LSP returns an invalid result. Log messages between neovim and the LSP, and check if the response is already corrupted at that point.

Tip: vim.lsp.set_log_level("debug").

Useless trivia

The output that I expect after running the code action should have been 37 bytes, but it's 50 bytes.

Ugh, my above test seems to have been messed up by selecting the whole line. Looks like my issue was caused by the trailing newline in the selection. Totally unrelated bug.

I can't repro by calling the LSP action, so I suspect the issue is the plugin itself.

In my case, this is the response that returns from the LSP:

[DEBUG][2024-09-11 10:00:01] .../vim/lsp/rpc.lua:408	"rpc.receive"	{  id = 0,  jsonrpc = "2.0",  method = "workspace/applyEdit",  params = {    edit = {      changes = {        ["file:///home/hugo/zk/test.md"] = { {            newText = "[Текст заголовка](huuh)",            range = {              ["end"] = {                character = 16,                line = 0              },              start = {                character = 0,                line = 0              }            }          } }      }    }  }}

It indicates that the client should replace the first 16 characters in the file with the string [Текст заголовка](huuh).

If we replace the first sixteen bytes instead of the first 16 characters, the result is exactly what neovim is producing.

So the issue that I'm seeing seems to be a bug in neovim, and entirely unrelated to op's issue.

I installed the neovim plugin and reproduced op's issue with LSP logging enabled. This is the request sent to he LSP:

[DEBUG][2024-09-11 10:11:53] ...m/lsp/client.lua:676	"LSP[zk]"	"client.request"	1	"workspace/executeCommand"	{  arguments = { "/home/hugo/zk/test.md", {      insertLinkAtLocation = {        range = {          ["end"] = {            character = 28,            line = 0          },          start = {            character = 0,            line = 0          }        },        uri = "file:///home/hugo/zk/test.md"      },      title = "Текст заголовк    } },  command = "zk.new"}	<function 1>	1

EDIT: Note that title = "Текст заголовк appears to have no closing quote. This is a rendering but in the logs, because the previous byte is an invalid sequence. It's the � that op was seeing.

The message indicates that the selected range ends at character 28. This is zero-indexed, so neovim is telling the LSP "execute the action on this range with 29 characters". But the range has 15 characters, and 29 bytes, so this doesn't sound right to me.

At this point, we've given the LSP invalid input, and it returned garbled output:

[DEBUG][2024-09-11 10:10:20] .../vim/lsp/rpc.lua:408	"rpc.receive"	{  id = 0,  jsonrpc = "2.0",  method = "workspace/applyEdit",  params = {    edit = {      changes = {        ["file:///home/hugo/zk/test.md"] = { {            newText = "[Текст заголовк�](h71d)",            range = {              ["end"] = {                character = 28,                line = 0              },              start = {                character = 0,                line = 0              }            }          } }      }    }  }}

The exact meaning of these positions is negotiated between the client and server, but the default is UTF-16: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position

UTF-16 because LSP has origins in JavaScript world, which uses this encoding by default. Nobody else (except Java?) uses this.

This is also what neovim and the LSP agree on in this case:

[DEBUG][2024-09-11 10:17:59] .../vim/lsp/rpc.lua:286	"rpc.send"	{  id = 1,  jsonrpc = "2.0",  method = "initialize",  params = {    capabilities = {      general = {        positionEncodings = { "utf-16" }      },      textDocument = {        callHierarchy = {          dynamicRegistration = false        },        codeAction = {          codeActionLiteralSupport = {            codeActionKind = {              valueSet = { "", "quickfix", "refactor", "refactor.extract", "refactor.inline", "refactor.rewrite", "source", "source.organizeImports" }            }          },          dataSupport = true,          dynamicRegistration = true,          isPreferredSupport = true,          resolveSupport = {            properties = { "edit" }          }        },        completion = {          completionItem = {            commitCharactersSupport = false,            deprecatedSupport = false,            documentationFormat = { "markdown", "plaintext" },            preselectSupport = false,            snippetSupport = false          },          completionItemKind = {            valueSet = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 }          },          completionList = {            itemDefaults = { "editRange", "insertTextFormat", "insertTextMode", "data" }          },          contextSupport = false,          dynamicRegistration = false        },        declaration = {          linkSupport = true        },        definition = {          dynamicRegistration = true,          linkSupport = true        },        diagnostic = {          dynamicRegistration = false        },        documentHighlight = {          dynamicRegistration = false        },        documentSymbol = {          dynamicRegistration = false,          hierarchicalDocumentSymbolSupport = true,          symbolKind = {            valueSet = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 }          }        },        formatting = {          dynamicRegistration = true        },        hover = {          contentFormat = { "markdown", "plaintext" },          dynamicRegistration = true        },        implementation = {          linkSupport = true        },        inlayHint = {          dynamicRegistration = true,          resolveSupport = {            properties = { "textEdits", "tooltip", "location", "c

In UTF-16, the input string is 32bytes long.

Neovim says the range ends at 28. This index is zero-based, so neovim is telling the LSP "apply this action on this range of length 29".

The selected text is "of length 29" if you encode in utf-8, but in utf-16, it should be of length 32.

Neovim needs to re-encode the text into utf-16 to send the proper length or negotiate utf-8 with the server. The problem is that support for utf-8 is not mandatory per the specification, because that would make life too simple.

I inspected the value returned by:

https://github.com/zk-org/zk-nvim/blob/main/lua/zk/commands/builtin.lua#L21

And it returns:

{                                                                                                                 
  range = {                                                                                                       
    ["end"] = {                                                                                                   
      character = 28,                                                                                             
      line = 0                                                                                                    
    },                                                                                                            
    start = {                                                                                                     
      character = 0,                                                                                              
      line = 0                                                                                                    
    }                                                                                                             
  },                                                                                                              
  uri = "file:///home/hugo/zk/test.md"                                                                            
}

If I understand correctly, this function needs to return indexes in utf-16 code points, but this is the index in bytes.

I added print(selected_text) on the value that the plugin sends to the LSP:

zk-nvim/lua/zk/commands/builtin.lua

Line 22 in dbf4eea

local selected_text = util.get_text_in_range(location.range)

And it prints:

Текст заголовк<d0>

This is missing the last byte. It is not entirely clear to me why.

This patch fixes the text for the newly created note, but the link to it still has a trailing broken byte:

diff --git a/lua/zk/util.lua b/lua/zk/util.lua
index 70f9701..43c15b6 100644
--- a/lua/zk/util.lua
+++ b/lua/zk/util.lua
@@ -100,13 +100,15 @@ function M.get_text_in_range(range)
   local A = range["start"]
   local B = range["end"]
 
-  local lines = vim.api.nvim_buf_get_lines(0, A.line, B.line + 1, true)
-  if vim.tbl_isempty(lines) then
-    return nil
+  local region = vim.region(0, { A.line, A.character }, { B.line, B.character }, vim.fn.visualmode(), true)
+
+  local lines = {}
+  local maxcol = vim.v.maxcol
+  for line, cols in vim.spairs(region) do
+    local endcol = cols[2] == maxcol and -1 or cols[2]
+    local chunk = vim.api.nvim_buf_get_text(0, line, cols[1], line, endcol, {})[1]
+    table.insert(lines, chunk)
   end
-  local MAX_STRING_SUB_INDEX = 2 ^ 31 - 1 -- LuaJIT only supports 32bit integers for `string.sub` (in block selection B.character is 2^31)
-  lines[#lines] = string.sub(lines[#lines], 1, math.min(B.character, MAX_STRING_SUB_INDEX))
-  lines[1] = string.sub(lines[1], math.min(A.character + 1, MAX_STRING_SUB_INDEX))
   return table.concat(lines, "\n")
 end

See: #187

The inserted link still has a trailing invalid byte. I don't know why this happens, but my impression is that when neovim replaces the original text, it replaces up to the first byte in the last character, and the second byte in that character is left behind.

If my impression is correct, it is the same bug that I saw when calling the code action manually, and it is likely a bug in neovim itself.

Regarding the scenario where I selected the trailing newline, from the LSP spec:

If the character value is greater than the line length it defaults back to the line length.

I think this is what happened in my case, the line length seems to be calculated by counting characters, but the replacement applied counting bytes.

As per my understanding of how positions are supposed to work in the language server protocol, a lot of the indexes being exchanged are wrong. Both neovim and the LSP are counting octets instead of code units. At least one of the following is true:

Both neovim and the LSP are doing something slightly wrong, but align with one another.
My understanding of the above link is mistaken.

The latest version of the linked PR fixes all issues when using the plugin.

Outstanding!!! I've had an initial read, but will need another pass or two to get it through the remaining less thick layers of my skull. Will hit you up with any questions over in the PR if needed 😄