ksh93/ksh

ksh93's printf(1) builtin somehow turns "\x20" (space character in UTF-8) into garbage

gisburn opened this issue · 10 comments

ksh93's printf(1) builtin somehow turns "\x20" (space character in UTF-8) into garbage:

Testcase:

$ (LC_ALL=en_US.UTF-8 ksh93 -c "printf 'foo\x20bar\n'")
foo₺r
$ (LC_ALL=en_US.UTF-8 bash -c "printf 'foo\x20bar\n'")
foo bar

This happens with ksh93 1.0.8 and 1.0.9 on both Debian Linux 10+Cygwin 3.5.3

Also reproducible on macOS, on ksh 93u+m/1.0.9 as well as ksh 93u+ 2012-08-01. Thanks for the report.

Actually this may not be exactly a bug – b and a are hexadecimal digits, so it is taking 20ba as a number, which happens to be out of range. In an ANSI C string, this is an error, so your reproducer is not actually valid.

It works fine with protective square brackets (which are a ksh extension):

$ printf 'foo\x[20]bar\n'
foo bar

However, I noticed that the 93v- beta changed the behaviour, and your reproducer started working 'correctly' between version 2013-06-28 and 2013-07-19. So we can consider isolating and backporting that change.

AFAIK the POSIX standard somewhere says that \x takes exactly TWO hexadecimal digits, and that'as why this was fixed for ksh93v- (POSIX test suite failure).

Relevant historic entry in src/lib/libast/RELEASE:

13-07-16 string/chresc.c: limit \xH \xHH to 2 hex digits (\u and \U handle wide chars now)

\x is not in POSIX at all (see https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html, Extended Description).

This patch backports the relevant 93v- 2013-07-19 changes. All existing regression tests pass with the patch, and your reproducer is fixed.

Now we will need to decide whether to change the behaviour for the next 93u+m/1.0.10 bugfix release or backport the change only for the future 93u+m/1.1 major update.

I think it needs to be the latter, because the change is not backward compatible. For instance:

$ echo $'\x1f600'
😀
$ echo $'\u[1f600]'
😀

The patch breaks the \x case (although \x[1f600] continues to work).

diff --git a/src/lib/libast/string/chresc.c b/src/lib/libast/string/chresc.c
index 31e8f5865..2bdc589f4 100644
--- a/src/lib/libast/string/chresc.c
+++ b/src/lib/libast/string/chresc.c
@@ -1,8 +1,8 @@
 /***********************************************************************
 *                                                                      *
 *               This software is part of the ast package               *
-*          Copyright (c) 1985-2012 AT&T Intellectual Property          *
-*          Copyright (c) 2020-2023 Contributors to ksh 93u+m           *
+*          Copyright (c) 1985-2013 AT&T Intellectual Property          *
+*          Copyright (c) 2020-2024 Contributors to ksh 93u+m           *
 *                      and is licensed under the                       *
 *                 Eclipse Public License, Version 2.0                  *
 *                                                                      *
@@ -147,14 +147,18 @@ chrexp(const char* s, char** p, int* m, int flags)
 				c = CC_vt;
 				break;
 			case 'u':
+				q = s + 4;
+				goto wex;
 			case 'U':
+				q = s + 8;
+			wex:
+				if (!(flags & FMT_EXP_WIDE))
+					goto noexpand;
+				w = 1;
+				goto hex;
 			case 'x':
-				if (q = c == 'u' ? (s + 4) : c == 'U' ? (s + 8) : NULL)
-				{
-					if (!(flags & FMT_EXP_WIDE))
-						goto noexpand;
-					w = 1;
-				}
+				q = s + 2;
+			hex:
 				b = e = s;
 				n = 0;
 				c = 0;
avih commented

\x is not in POSIX at all

Correct for the shell.

So, unless documented otherwise, and ignoring backward compatibility for now, I'd think the optimal behavior would be to follow the C string specification (which builds on a single character constant spec).

To summarize the C behavior of the escape sequence \x... in C99 "6.4.4.4 Character constants":

  • The sequence includes all chars until non-hex-digit (or end of the string).
  • The sequence represents one char in normal C string, or one wide char in a wide string.
  • If the value does not fit the target size, the behavior is implementation defined.
  • AFAICT it's not possible to use \x... sequence which is followed by literal hex-digits, even if the value is bigger than the target. In such case, only an octal sequence[s] can be used (of 3 digits if the following char is valid octal digit).

To quote from C99 in 6.4.4.4:

14
EXAMPLE 3
Even if eight bits are used for objects that have type char, the
construction '\x123' specifies an integer character constant
containing only one character, since a hexadecimal escape sequence
is terminated only by a non-hexadecimal character. To specify
an integer character constant containing the two characters whose
values are '\x12' and '3', the construction '\0223' may be used,
since an octal escape sequence is terminated after three octal
digits. (The value of this two-character integer character constant
is implementation-defined.)

I don't know whether a shell string should be considered wide or normal, but I'd think that at least if the encoding is UTF-8, then it should be considered normal, at which case a \x... represents one char/byte, and useful value for such sequence would be, IMHO, the unsigned value of the whole sequence (as many hex digits which it might have) modulu 256 (assuming 8 bits char).

  1. I would prefer that for single-bytes we get \x<hexchar><hexchar> and \x[<hexchar><hexchar>] working, but only for single bytes, otherwise we have an endian issue.

  2. Wide-char charcters should be covered by \u[<hexvalue>] and \w[<hexvalue>] - I and the SUN i18n team made a patch for ksh93v- for that.
    \u[<hexvalue>] takes in a Unicode codepoint value, and uses |iconv()| internally to convert it to a multibyte sequence in the current locale (and evaluates to empty string if the character cannot be represented)
    \w[<hexvalue>] is basically the same, but just converts a |wchar_t| codepoint to the matching multibyte sequence, but without |iconv()| from UTC32 to current locale. This was added to satisfy PRC requirements for GB18030, so you can take a GB18030 codepoint and get the current character.
    I think I can dig out the patch for \u[<hexvalue>] and \w[<hexvalue>] later this month and give it to you.

avih commented
  1. I would prefer that for single-bytes we get \x<hexchar><hexchar> and \x[<hexchar><hexchar>] working, but only for single bytes, otherwise we have an endian issue.

As far as I can tell that's also how other shells which support \x behave:

$ shcmp -c 'printf "X\x4040Y"'
...

= bash, bash_posix, loksh, oksh, mksh, mksh_posix, pdksh, bb_ash, nbsh:
X@40Y(exit:0)

= ksh93, ksh93u_m:
X䁀Y(exit:0)

The rest (all dash versions, freebsd sh, Schilly's bosh, yash) don't support \x as a hex sequenced.

(comparison output using shcmp)

@avih wrote:

[…] I'd think the optimal behavior would be to follow the C string specification (which builds on a single character constant spec).

To summarize the C behavior of the escape sequence \x... in C99 "6.4.4.4 Character constants":

  • The sequence includes all chars until non-hex-digit (or end of the string).

[…]

So, AFAICT, that would mean changing nothing and keeping the existing 93u+m behaviour.

On the other hand, your shell test results show that 93u+(m) is very much the odd one out at the moment and all the other shells limit \x to two hexadecimal digits.

I favour backporting the 93v- change for 93u+m/1.1 to fall in line with the rest of them (including, notably, the pdksh variants). As the libast RELEASE entry said, \u and \U should be used for wide/multibyte characters.