/wcwidth.awk

An implementation of wcwidth / wcswidth in pure AWK

Primary LanguageAwkMIT LicenseMIT

wcwidth.awk

The standard "length" function in AWK counts the number of characters in a string which does not necessarily correspond to the visual width of the string. This AWK library provides functions that can be used to determine the width of UTF-8 characters even on interpreters that are not multi-byte safe. In addition to reimplementations of the POSIX functions wcwidth(3) and wcswidth(3), this library provides "wcscolumns", a function with graceful degradation in the presence of characters that would cause the POSIX functions to return -1; "wcstruncate", a function for truncating a string to a specific visual column; and "wcsexpand", a function that expands tabs to spaces in a wide character-aware manner.

The library is written so as to be portable across AWK interpreters; if the interpreter does not have native support for multi-byte characters, the library will fall back to using its own UTF-8 logic. This library has been successfully tested with these AWK implementations:

The width data used by the preexisting "wcwidth.awk" comes from version 2.24 of the GNU C Library, but it can be rebuilt to tailor to other systems using instructions in the Development section at the end of this document.

Usage and Functions

The entirety of the library is contained in the "wcwidth.awk" file included with this repository. The library has no dependencies (the other files in this repository are for development purposes), so installation consists of copying "wcwidth.awk" somewhere convenient. To use the functions in the library, source the file using AWK's "-f" option or any equivalent construct like GNU Awk's "@include". The library only uses "exit" if the library is incompatible with the interpreter. In that case, "WCWIDTH_INITIALIZED" is set to -1 so failed initialization can be detected with WCWIDTH_INITIALIZED < 0 in an "END" block. The special AWK variables "RSTART" and "RLENGTH" will always have the same values after calling one of these functions as they did prior to the invocation.

wcscolumns(string)

Determine the number of columns needed to display a string. This function differs from the "wcswidth" function in its handling of non-printable characters; instead of making the function abort and immediately return -1, non-printable ASCII characters are ignored while all others are treated as as having a width of 1 because they will typically be rendered as a single-column ".notdef" glyph.

Arguments:

  • string: A string of any length. In AWK interpreters that are not multi-byte safe, this argument is interpreted as a UTF-8 encoded string.

Returns: The number of columns needed to display the string. This value will always be greater than or equal to 0.

Example:

$ cat example.awk
{
    printf "wcscolumns(\"%s\") → %s\n", $0, wcscolumns($0)
}
$ echo "A宽BデC🦀D" | awk -f wcwidth.awk -f example.awk
wcscolumns("A宽BデC🦀D") → 10

wcsexpand(string, tab_stop)

Expand tabs to spaces in a wide character-aware manner. Calculations done by this function assume the first character of the string is the first character of the line or the first character following a tab.

Arguments:

  • string: The string to expand.
  • tab_stop: The maximum width of tabs. This must be an integer greater than zero.

Returns: A string with all tabs replaced with spaces.

Example:

$ cat example.awk
{
    printf "wcsexpand(\"%s\", 8) → %s\n", $0, wcsexpand($0, 8)
}
$ printf "Rat\t鼠\t2020\n" | awk -f wcwidth.awk -f example.awk
wcsexpand("Rat	鼠	2020", 8) → Rat     鼠      2020

wcstruncate(string, columns)

Truncate a string so that it spans a limited number of columns.

Arguments:

  • string: A string of any length. In AWK interpreters that are not multi-byte safe, this argument is interpreted as a UTF-8 encoded string.
  • columns: Maximum number of columns the resulting text may span.

Returns: Input string truncated as needed.

Example:

$ cat example.awk
{
    printf "wcstruncate(\"%s\", %d) → %s\n", $1, $2, wcstruncate($1, $2)
}
$ printf "AエBビCシ 5\nxYzPqR 10\n" | awk -f wcwidth.awk -f example.awk
wcstruncate("AエBビCシ", 5) → AエB
wcstruncate("xYzPqR", 10) → xYzPqR

wcswidth(string)

A reimplementation of the POSIX function of the same name to determine the number of columns needed to display a string.

Arguments:

  • string: A string of any length. In AWK interpreters that are not multi-byte safe, this argument is interpreted as a UTF-8 encoded string.

Returns: The number of columns needed to display the string is returned if all of character are printable and -1 if any are not.

Example:

$ cat example.awk
{
    printf "wcswidth(\"%s\") → %s\n", $0, wcswidth($0)
}
$ printf "津波\n概要\t20世紀\n" | awk -f wcwidth.awk -f example.awk
wcswidth("津波") → 4
wcswidth("概要	20世紀") → -1

wcwidth(character)

A reimplementation of the POSIX function of the same name to determine the number of columns needed to display a single character.

Arguments:

  • character: A single character. In AWK interpreters that are not multi-byte safe, this argument may consist of multiple characters that together represent a single UTF-8 encoded code point.

Returns: The number of columns needed to display the character if it is printable and -1 if it is not. If the argument does not contain exactly one UTF-8 character, -1 is returned.

Example:

$ cat example.awk
{
    printf "wcwidth(\"%s\") → %s\n", $0, wcwidth($0)
}
$ printf "X\n宽\n:)\n" | awk -f wcwidth.awk -f example.awk
wcwidth("X") → 1
wcwidth("宽") → 2
wcwidth(":)") → -1

Development

Unicode data is generated using the code in "generate-width-data.c" and written to "width-data". This data is inserted into "template.awk" on the line consisting of a comment that contains "[WIDTH DATA]", and the resulting file is written to "wcwidth.awk".

Makefile Targets:

  • all / test: Verify that the "wcwidth" function works as expected. If "wcwidth.awk" does not exist, it will be created automatically. This target is the default target.
  • width-data: Enumerate all Unicode code points and write the information to a file named "width-data". The existing file provided with this repository was generated using GNU libc 2.24. If the values returned by "wcwidth.awk" do not seem to match the behavior of the host's wcwidth(3) implementation, delete "width-data" and run make to regenerate a properly tailored file.
  • clean: Delete the binary used to generate the "width-data" file.
  • wcwidth.awk: Generate the "wcwidth.awk" file.