A set of naive tools to convert between broken UTF-16 and WTF-8. See https://en.wikipedia.org/wiki/UTF-8#WTF-8
The only purpose of these tools is to convert to and from broken UTF-16 (that is, with unpaired surrogates), which Windows seem to happily generate.
Basically, all it does is happily read or write unpaired surrogate halves.
wtf162wtf8
reads UTF-16 code units, and tries to read code points. If that
succeeds, write the read code point as UTF-8. If it doesn't succeed, i.e. if
it is a high or low surrogate without its other half, write the surrogate half
as UTF-8 (which makes it WTF-8).
The result is WTF-8, and even UTF-8 if the input is valid UTF-16.
wtf82utf16
does the revers conversion: given WTF-8 input, it reconstructs
the possibly broken UTF-16 data. All it does is actually write every code
points below 0x10000
as plain UTF-16 units, even surrogate halves.
As a proof of concept, there is also support for broken UTF-32. Just like
WTF-8 and broken UTF-16, is allows reserved code points to appear and encodes
and decodes them happily. Only WTF-8/UTF-32 pairs are provided, but they can
be streamed together to convert directly between UTF-16 and UTF-32, using e.g.
wtf162wtf8 < input | wtf82utf32 > output
.
These tools are naive, and don't actually do anything about endianess. The result is that if they are run on a Big Endian machine, they read and write UTF-16BE, and if they are run on a Little Endian machine (fairly more common), they read and write UTF-16LE.
As those tools are typically useful with UTF-16LE, and most machines are Little Endian, it should generally work fine. Hopefully.
To convert from (broken) UTF-16 to WTF-8, use wtf162wtf8 < input > output
.
Similarly, to convert from WTF-8 to (broken) UTF-16, use
wtf82utf16 < input > output
.
You can control the verbosity through the VERBOSE
environment variable: set
it to a positive integer to get verbose/debugging output on stderr
.