This crate provides a function to compute the number of columns occupied by a single unicode grapheme. It's distinguishing factors are that it always returns the correct display width, backwards compatability with older unicode versions and being lightweight.
There are two other options that provide similar functionality already in the rust ecosystem:
- The unicode width crate
- The
grapheme_column_width
function of thetermwiz
crate.
Both options have drawbacks
The unicode-width crate currently doesn't support doesn't account for emoji presentation/width changes caused by emoji variation selectors VS15 and VS16 introduced in unicode 14. For example the following emojis have text presentation by default but because they are followed by VS16 they are switched to emoji presentation and double width: ✔️ 🖋️.This different width definition is not supported by all emulators yet. This crate works around that by allowing applications/users to configure the unicode support level manually.
To handle the cases described above the termwiz
crate (part of the wezterm
emulator) switch to a custom grapheme width calculation based on widecharwidth. To account for unicode 14 presentation changes handling for emoji variations were also added. However, the termwiz crate is a very heavy dependency. Not only does it contain a LOT of functionality itself it also has a large number of (transitive) dependencies. Furthermore, while inspecting the width calculation in that crate I actually noticed some inefficiencies:
- Emoji presentation is queries from a separate
ucd-tri
despite the fact thatwidecharwidth
already displays allEmoji_Presentation
emojis as double width - A perfect HashMap is used for looking up emoji variations which is likely slower than
ucd-tri
. More importantly this introduces an extra dependency. - For characters outside the first utf-16 plane it falls back to multiple binary searches of uncompressed tables
Compared to that
unicode-width
is very lightweight as it has no extra dependencies and width calculation just compiles to a O(1) lookup in a compressed three level table (somewhat similar toucd-tri
).
The goal of this crate is to combine the advantages of both. It implements the same notion of width as termwiz
does. However, this crate generates its own compressed lookup table just like unicode-width
(just with different content). Emoji variations are implemented using a single ucd-tri
. As a result this crate is very lightweight (only depends on the tiny ucd-tri
crate) and performant. Both crates were heavily referenced while developing grapheme-width
and are credited here as such.
To work correctly this crate calculates the width of each grapheme individually (just like termwiz
). For convenience a function that segments the string into its grapheme and sums up their widths is provided if the segmentation
feature is enabled.
Unicode 14 is still quite new and therefore adjusting the presentation as described above can cause compatability problems with programs that don't support unicode 14 yet. To allow downstream crates to retain compatability with these programs grapheme-width
requires calle to specify a unicode capability level. Ideally this compatability level should be runtime configurable as there is no standard way to negotiate a unicode version.
The MSRV required to build grapheme-width
is 1.65.
The MSRC increased conservatively when necessary (rarely).
It will never exceed the MSRV required by firefox to remain compatible with a wide variety of Linux distros.
Not that the MSRV policy does not apply to the xtask
build script or dev-dependencies
as these are only used by dependencies and don't affect downstream crates.