Unicode-normalised filenames not matched by non-normalised links

Question

Unicode-normalised filenames not matched by non-normalised links

cormacrelf opened this issue 4 years ago · 1 comments

In a nutshell

There's a second layer to the unicode file name problem. Many filesystems and programs perform Unicode normalisation on file names, and this can change on the fly as people open a zettelkasten on a different computer. This is an issue relevant to Neuron, because when people write [[links]] in markdown files (not normalised), those links have to be reconciled with file names. Both file names and zettel links should be considered to contain unknown unicode normalisation, so you can get broken links in a number of ways.

What should Neuron do?

Add a call to a Unicode normalisation library like unicode-transforms to normalise to NFC when creating a Zettel ID, whether from a file name, or from a [[link]]. This enables linkage to be independent of whatever filesystem you're running on. And probably also document that Zettel IDs are NFC normalised.

Context

Here's the lay of the land as I understand it, for file names:

On Linux, most filesystems apply NFC normalisation (canonical decompose + canonical recompose).
On Windows I think they just used UTF-16 and called it a day.
On macOS, HFS+ used to use NFD (canonical decompose only), but now it's more complicated: many apps (e.g. TextEdit) will normalise in some way, but APFS will not normalise anything itself and can happily hold NFD or NFC text, or really most byte strings.
- nvim filename.txt on the command line does not normalise.
- The system 'Save File' dialogue DOES normalise to NFD.
- If you check files into Git on APFS, it normalises all file names with NFC for better compatibility. Probably only changes them when you checkout again though!

However, for file contents, no text editors perform any unicode normalisation, and so they shouldn't; they store whatever you type in there. System input methods do not normalise either. This is all correct and good; normalisation is used only for "set semantics" where having multiple encodings of é is really confusing, i.e. you get two files in a folder both apparently called é.txt but using different Unicode representations of é.

Basically, Neuron Zettel IDs have these "set semantics". So they need to be normalised. The best form to use would be NFC, because:

Git did it and they're probably right
HFS+, the only enforced NFD normaliser, is basically dead, but this still doesn't even break it because if you're serving /blah.html then it normalises when you ask for that path.
It doesn't have so many display issues in dumb environments like terminals that don't recompose ()

Repro/example

You can repro this example by saving Korean file names on a filesystem which can store file names in NFD. I'm using a Mac with APFS, and saving the file with TextEdit.

$ ls
index.md
이분.md
neuron.dhall

That filename without extension is the byte sequence e1 84 8b e1 85 b5 e1 84 87 e1 85 ae e1 86 ab, made via NFD normalisation. To be sure, ls > file.txt and inspect its bytes from there using a Hex editor. Here's index.md:

Link to the other zettel using text typed on the keyboard [[이분]]
This link target happens to be in NFC form, byte sequence `ec 9d b4 eb b6 84`
But user input may not be in NFC form. It's purely coincidental.

Basically:

Viewing the generated impulse.html in a hex editor shows the problem:

If you have a Mac, you can download the Apfelstrudel utility to get a sense of how differently a filename is stored in NFD VS how it is stored when you type it again in a [[link]].

On APFS, because normalisation is not enforced anywhere, you can also attempt to "fix" the zettels by running:

brew install convmv
convmv -r -f utf8 -t utf8 --nfc --notest .
# and break it again
convmv -r -f utf8 -t utf8 --nfd --notest .

Noting, again, that this only works because the index.md happened to have a NFC normalised link in it, but that will not always be the case.

Answer 1 · 2021-05-06T21:50:41.000Z

Awesome, thanks for the explanation!