This project explores the ins and outs of unicode characters in filenames in development workflows
Clone the project into a directory with a unicode character to start the experiment:
git clone https://github.com/meshula/m-n.git mün
On MSVC,
cl héllòmüñ.c
héllòmüñ.exe
various results and diagnostics ensue.
In the following output, we can conclude that only this declaration
unsigned char* umlaut_a = "ü"; is utf8
can be converted by windows to utf16 in a form that the win sdk can understand.
C string literals declared with any adornments are not compatible with the win sdk.
Only umlaut_a
is convertible by MultiByteToWideChar
such that win sdk functions succeed.
Same for the wide char functions such as wcslen, they expect as input, u16 encoded strings.
+--------------------------------------------------+
| Héllø Mün! |
+--------------------------------------------------+
+--------------------------------------------------+
| Size of wchar_t |
+--------------------------------------------------+
size of wchar_t is 2
+--------------------------------------------------+
| C umlaut U encodings |
+--------------------------------------------------+
wchar_t* umlaut_u = L"ü"; is utf8
c3 00 bc 00 00 00 wcslen is 2
unsigned char* umlaut_a = "ü"; is utf8
c3 bc 00 00 wcslen is 8
wchar_t* umlaut_w = u"ü"; is utf8
c3 00 bc 00 00 00 wcslen is 8
unsigned int* umlaut_W = U"ü"; is utf8
wchar_t* umlaut_uu = L"\uc3bc"; is utf8
c3 00 00 00 bc 00 00 00 00 00 00 00 wcslen is 8
+--------------------------------------------------+
| check wcslen(héllòmüñ), should be 8 |
+--------------------------------------------------+
utf16 a required size is 9
hello_u: 12
hello_a: 6
hello_w: 12
hello_W: 1
hello_uu: 8
hello_utf16: 8
+--------------------------------------------------+
| Check uft16 conversions |
+--------------------------------------------------+
utf16 a required size is 11
L"héllòmüñ.c": 68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
as U16: 68
"héllòmüñ.c": 68 C3 A9 6C 6C C3 B2 6D C3 BC C3 B1 2E 63
as U16: 68 E9 6C 6C F2 6D FC F1 2E 63
+--------------------------------------------------+
| GetFileAttributesW/A |
+--------------------------------------------------+
GetFileAttributesW(u8"héllòmüñ.c") failed: 2
GetFileAttributesA(u8"héllòmüñ.c") failed: 2
GetFileAttributesW(u16"héllòmüñ.c") succeeded
+--------------------------------------------------+
| FindFirstFileW/A |
+--------------------------------------------------+
FindFirstFileA("h*.c") succeeded h�ll�m��.c
FindFirstFileW(L"h*.c") succeeded h�ll�m��.c
GetFileAttributesW on found path succeeded
FindFirstFileW(utf16 converted) succeeded h�ll�m��.c
GetFileAttributesW on found path succeeded
+--------------------------------------------------+
| Finished |
+--------------------------------------------------+
Code for mac and linux to be added once the Windows side is fully explored.
- 🦋 all compilers can compile the c file, the embedded utf8 string prints correctly
- 🦋 msvc debugger displays utf8 correctly
- 🐛 msvc debugger displays utf16 incorrectly
- 🐛 msvc compile/link messages, compiler error windows: unicode not displayed consistently or properly
- 🦋 Unicode cmake project name is fine
- 🐛 CMake can't use unicode for target names
CMake Error at CMakeLists.txt:3 (add_executable):
The target name "héllòmün" is reserved or not valid for certain CMake
features, such as generator expressions, and may result in undefined
behavior.
- 🦋 unicode filename for c file is fine in cmake
- 🐛 boost build fails when the installation target is a path with a unicode character; every copy fails with a message showing a corrupt path. In the follow example, I tried to build to a directory named üsd-install.
copy /b "C:\�sd-install\src\boost_1_78_0\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp" + this-file-does-not-exist-A698EE7806899E69 "c:\�sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp"
...failed common.copy c:\�sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_9.hpp...
common.copy c:\�sd-install\include\boost-1_78\boost\vmd\detail\recurse\data_equal\data_equal_headers.hpp
The system cannot find the path specified.
- 🐛 bjam creates a corrupt directory name during the installation.
ⁿsd-install
- 🐛 Can't add unicode filename to git on mac
git add "h\303\251ll\303\262m\303\274\303\261.c"
fatal: pathspec 'h\303\251ll\303\262m\303\274\303\261.c' did not match any files
solution:
git config --global core.precomposeunicode true
git add héllòmüñ.c
git status
Changes to be committed:
...
new file: "h\303\251ll\303\262m\303\274\303\261.c"
- 🐛 Git branches with characters in their names that are illegal in filenames on Windows lead to problems because branch names are used as filenames in the .git directory: git-for-windows/git#2904 It's possible to create and push such a branch on macOs and cause errors on Windows. (Thanks @Simran)
- 🐛 github create mün - repo created as m-n
- 🦋 héllòmüñ.c appears on github as héllòmüñ.c
- 🦋 modify héllòmüñ.c locally and push
- 🦋 clone on mac
- 🦋 clone on mac without core.precomposeunicode true flag
- 🦋 clone on windows
- 🦋 touch file on windows, commit, push
- 🦋 touch file on linux, commit, push
- 🦋 add héllòmüñ.c to a changelist
- 💻 Some Utf8 -> 16 code that's easy to read: https://gist.github.com/tommai78101/3631ed1f136b78238e85582f08bdc618
- 📚 The Absolute Minimum https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
- 📚 UTF8 everywhere https://utf8everywhere.org
- 📚 OpenAssetIO string encoding https://github.com/OpenAssetIO/OpenAssetIO/blob/main/doc/decisions/DR005-String-encoding.md
- 📚 Fixing Unix/Linux/POSIX Filenames: Control Characters (such as Newline), Leading Dashes, and Other Problems, David A. Wheeler, 15 Nov 2020 https://dwheeler.com/essays/fixing-unix-linux-filenames.html
- 💻 a yikes hack to bridge utf16 and utf8, mostly intended to help with Microsoft wchar APIs: https://simonsapin.github.io/wtf-8/