ericpruitt/wcwidth.awk

Treat TABs as printable

Closed this issue · 11 comments

xrat commented

In the context of Awk I guess handling non-printable characters is hard (cf. https://unix.stackexchange.com/a/245021/13746) but honoring TABs might be helpful.

$ cat wcscolumns.awk 
{ printf "%d\n", wcscolumns($0) }
$ awk -f wcwidth.awk -f wcscolumns.awk <<< $'My sign is\t鼠鼠'
14
$ echo $'123456789012345678901234567890\nMy sign is\t鼠鼠'
123456789012345678901234567890
My sign is	鼠鼠
$ wc -L <<< $'My sign is\t鼠鼠'
20

(The above example is from Bash. Here at GitHub it's not shown exactly as on the terminal, at least not in all views. In my terminal the right margin of 鼠鼠 indeed is about 19.7 characters.)

I don't think there's a good way to handle tabs that doesn't add a lot of complexity for little value. One problem is differing tab widths. I use tab widths of 4 in Vim for my own code, but in non-editor applications like Less, I generally use a value of 8 since I think that's historically the most common value. Even if the number of tabs is fixed, you also have to take into account the fact that a tab could appear in the middle of a line that's indented, but we don't necessarily know whether the line passed with the functions has any leading whitespace that was temporarily stripped. My suggestion would be to expand tabs in the string before passing it into the function since you, as the developer and library user, know your environment better than I do, the library author. Here's some code I use to expand tabs in an unrelated AWK script:

# Replace tabs with spaces to simplify the wrapping process.
while ((i = index($0, "\t"))) {
    sub(/\t/, sprintf("%*s", (TAB_SIZE - ((i - 1) % TAB_SIZE)), ""))
}
xrat commented

Oh :) I naively thought it would be easy (given that the user would need to provide a value for TAB_SIZE). In fact, I erroneously thought it would be easier than the code you showed for replacing tabs. Thanks a lot for your explanation!
Now that I better understand the problem I still think that it might be feasible to add the code for replacing TABs to wcwidth.awk, but then, of course, you are right that it can be applied "before passing it into" wcscolumns().

My argument against that would be that expanding tabs doesn't have anything to do with wide characters, but I realized that a wide character-aware expansion function would make sense because the code I pasted above won't work as-is if the string has wide characters. I'll follow-up once I've implemented this.

xrat commented

You prove yourself right. It's not trivial :-| Thanks a lot!

Please try this function and let me know it expands the way you'd expect:

# Expand tabs in a string to spaces.
#
# Arguments:
# - _str: The string to expand.
# - _tab_size: The maximum width of tabs.
#
# Returns: A string with all tabs replaced with spaces.
#
function wcexpand(_str, _tab_size,    _i, _prefix_width, _expanded_width)
{
    while ((_i = index(_str, "\t"))) {
        _prefix_width = wcscolumns(substr(_str, 1, _i - 1))
        _expanded_width = _tab_size - ((_prefix_width - 1) % _tab_size)
        sub(/\t/, sprintf("%*s", _expanded_width, ""), _str)
    }

    return _str
}

Example:

$ gawk -f wcwidth.awk -f <(echo 'BEGIN { print wcexpand("x\t鼠鼠\tx", 8); exit }')
x        鼠鼠    x
xrat commented

Eric, thanks a lot, it works very well apart from, apparently, a bug: _expanded_width should be calculated as

_expanded_width = _tab_size - (_prefix_width % _tab_size)

IMHO, wcexpand() is a great example of how to use wcscolumns().

Thanks, I actually caught that bug when I refactored the function later on; it currently reads:

# Expand tabs in a string to spaces.
#
# Arguments:
# - _str: The string to expand.
# - _tab_stop: The maximum width of tabs. This must be an integer greater than
#   zero.
#
# Returns: A string with all tabs replaced with spaces.
#
function wcexpand(_str, _tab_stop,    _column, _mark, _tab_index, _tab_width)
{
    _column = 0

    # An alternate implementation of this function used split(..., ..., "\t"),
    # but that approach was generally slower.
    for (_mark = 0; (_tab_index = index(_str, "\t")); _mark = _tab_index - 1) {
        _column += wcscolumns(substr(_str, _mark + 1, _tab_index - _mark - 1))
        _tab_width = _tab_stop - _column % _tab_stop
        sub(/\t/, sprintf("%*s", _tab_width, ""), _str)
    }

    return _str
}

I'll push the function and associated documentation and tests to the repo later today.

xrat commented

Great! I am not a programmer, but I think, personally, I'd add a check or similar for _tab_stop > 0.

I intentionally don't do any input validation on the functions provided by my library. Since AWK doesn't support exceptions, there are three* main things an AWK script can do to upon encountering an error: it can exit, but the library user may not want to do that; or it can write an error message, but that could interfere with the user's work since they might be using standard error for something; or it can do nothing. Rather than make a decision at the library level that forces everyone to adapt to one of the first two options, I document the constraints and do nothing leaving the responsibility of validating the input to the library user since, again, they know their environment better than I do.

EDIT: I neglected another option: a global variable akin to errno in C, but I question the value of such a thing in this context.

* A potential fourth option would be to use arrays to return multiple values, but that's not something I see done very often.

The function has been committed as 5513af6 sans tests which I'll add later after refactoring how tests are handled. For consistency with the POSIX functions, I named the function "wcsexpand".

xrat commented

I downloaded the new version and tested it with the set of strings of yesterday. All fine. Thanks a lot!