Does not work with (multibyte) UTF8 characters

Question

Does not work with (multibyte) UTF8 characters

Closed this issue a year ago · 5 comments

Thanks again for autoclose, which is terrific and very straightforward code to read!

I tried adding a pair for curly quotes (“”), and I ran into a problem. autoclose can insert the pair okay, but escape does not work. I think that's because you use Lua's string.sub to check for the last part of the pair, and Lua's string capabilities operate on bytes rather than UTF8 characters. (I also think that my problem is somewhat related to issues #39 and #23.)

This may be a reasonable limitation to keep for this plugin, but one approach occurs to me. If the pairs were entered as (for example) "(,)", then you could potentially pick out the closing part of the pair by splitting the pair on the comma. That might be a way forward, but it may complicate the code and people would have to adjust their configurations for the breaking change. In any case, I'm happy to work on this or submit a PR if you are interested.

Answer 1 · 2023-09-16T12:41:29.000Z

Just in case I did something else wrong, here's the setup code I was using.

require("autoclose").setup({
    keys = {
        ["“"] = {
            close = true,
            escape = false,
            pair = "“”",
            enabled_filetypes = { "markdown", "text", "mail" },
        },
        ["”"] = {
            close = false,
            escape = true,
            pair = "“”",
            enabled_filetypes = { "markdown", "text", "mail" },
        },
    },
})

Answer 2 · 2023-09-17T14:17:56.000Z

I also think that my problem is somewhat related to issues #39 and #23.

Yeah you are right.

I prefer this new way to config pair:

pair = { "(", ")" }

It would be great if you can open a PR to solve this! Just remember to add breaking change into commit messages (see conventional commit) so that user can know the change when updating plugins.

Answer 3 · 2023-09-23T20:15:10.000Z

It would be great if you can open a PR to solve this!

I apologize, but I was overly optimistic. After working on this for the past week, I've realized that the problems are a lot more complicated than I thought at first. So much of the current code relies on one character being equal to one byte, LuaJIT has no built-in way to deal with multibyte characters, and Neovim doesn't offer much support either.

I was able to get entering and escaping multibyte pairs to work, but I could not get deletion to work. (The problem is that if the key entered is, e.g., <BS>, you can't know how many bytes to grab relative to the cursor position in order to get "two characters." I tried using vim.str_utf_pos, but I couldn't get that to work for all positions on a line.) I may be missing something obvious, but I'm going to stop for now.

As for the other two issues, the more I thought about those, the harder they seemed to me. Those involve not individual characters (of whatever width), but truly multi-character groups. Since users will type those one character at a time, I can't see how to escape or delete those without changing this plugin almost entirely. (I think you'd need to track state and/or use regexes heavily in order to do that. Consider, e.g., multi-line comments inside of multi-character surroundings.)

tl;dr — this plugin is simple and works remarkably well for single-character ASCII pairs. Since that's most of what programmers need from an autoclose plugin, I'm going to try to stop worrying about multibyte or multi-character cases for now. I'll leave this open if you want, and I'll probably keep thinking about it (despite myself), but I doubt I'll work on it right now.

Answer 4 · 2023-09-25T07:16:58.000Z

Thanks for the attempts to solve this issue! It sounds complicated and this also not aligned with the philosophy of this plugin as I want to keep it minimal. I'll close the related issues and set them as wontfix.

Answer 5 · 2023-09-25T13:19:52.000Z

Naturally, after I said that I would stop thinking about the problem, I thought about it nonstop. After another day working on this, I have a version that supports multibyte characters. I will use this version for a week or two while I continue to try to clean up the code (which is ugly and may have edge cases I haven't thought of yet).

There is one problem with my solution. In documents with very long lines (e.g., in a markdown document where entire paragraphs are written as one line), then the code may repeatedly call vim.str_utf_pos and build a lookup table of characters by their positions in the line. This is inefficient compared with line:sub(col, col + 1), but I can't see an alternative. Edit: since first making this comment, I was able to make the function that builds the lookup table cleaner. It's still less simple and efficient than line:sub(col, col + 1), but it's not terrible anymore. On the other hand, it now only works for insert mode. I think that's a reasonable trade-off since people probably don't need autoclosing for multibyte characters on the command line!

You're under no obligation, of course, but if you have thoughts about my current solution, please comment there or make suggestions. I will submit my solution as a PR after I've tested it further and used it for a few weeks.