/utf8info

Reads UTF-8 on stdin and prints out the raw Unicode codepoints. Useful for seeing exactly what a string consists of.

Primary LanguageC++

utf8info

CircleCI

utf8info is a small utility that reads a UTF-8 stream and prints out the raw codepoint information. It's useful for spotting invisible control characters like U+202E RIGHT-TO-LEFT OVERRIDE, and interrogating complex Zero-Width-Joiner sequences like ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ, which is composed of 7 characters!

This tool supports codepoints from the latest published version of the Unicode Standard, sourcing data from the Unicode Character Database.

Building & Installing

On macOS and Linux, it should be as simple as running the following inside the utf8info directory:

make && make install

When a new version of the standard is released, you can fetch the latest UCD with make update, and then build as before.

Windows is not officially supported, but it'll likely work under WSL.

Note: Building utf8info depends on curl, unzip, and a C++17-compatible C++ compiler being present.

Options:

-v, --verbose       Enable verbose output. This prints the raw UTF-8 bytes next to the codepoint info.
-d, --definitions   Display definitions for CJK Unified Ideographs
-a, --all           List all known codepoints and exit.