henriklovhaug/md-tui

Unicode handling issues

cyqsimon opened this issue ยท 2 comments

Multi-char unicode characters inside code blocks seem to be improperly handled.

MRE

```sh
# ๐Ÿ™‚
```

Running mdt file.md results in a panic:

The application panicked (crashed).
  byte index 4 is not a char boundary; it is inside '๐Ÿ™‚' (bytes 3..7) of `
  # ๐Ÿ™‚
  `
in src/nodes/textcomponent.rs, line 379
thread: main

The problem

for (i, c) in word.content().chars().enumerate() {
if c == '\n' {
end = i;
let new_word =
Word::new(word.content()[start..end].to_string(), word.kind());
inner_content.push(new_word);
start = i + 1;
final_content.push(inner_content);
inner_content = Vec::new();
} else if i == word.content().len() - 1 {
let new_word =
Word::new(word.content()[start..].to_string(), word.kind());
inner_content.push(new_word);
}
}

Here, str::chars iterates by unicode characters, which is the kind of index your start and end refer to. However, the string slice syntax is not UTF8-aware and instead indexes by bytes. A UTF8-encoded unicode character is very often not 1 byte, so word.content()[start..end] and word.content()[start..] are semantically incorrect.

I think the simplest way to fix this is to accumulate the number of bytes using char::len_utf8 and use that as start and end. Although there's probably a cleaner way to write the whole blob.

Also I'm not sure if there are other instances of similar mistakes within the codebase. Maybe it's worth double checking.

Versions

0.7.3 (from ArchLinux repository) and 0.7.4 (from crates.io)

Hi. Thanks for the issue. Came across this yesterday myself. I naively thought tree-sitter would give back indexes on the char boundaries, but as you found out as well. It doesn't. Should not be an issue elsewhere, as I don't do many (any?) operations on single chars.

It's fixed. I want to fix the list alignment issue before I push out a new version