Bill-Gray/PDCursesMod

question: return of "extended/encoded" characters with CHTYPE_64

GitMensch opened this issue · 11 comments

Checking curses.h we have in general the following character range:

#ifdef CHTYPE_64
    # define PDC_CHARTEXT_BITS   21
    # define A_CHARTEXT   (chtype)( ((chtype)0x1 << PDC_CHARTEXT_BITS) - 1)
# else         /* plain ol' 32-bit chtypes */
    # define PDC_CHARTEXT_BITS      16
#ifdef PDC_WIDE
    # define A_CHARTEXT   (chtype)0x0000ffff
#else          /* with 8-bit chars,  we have bits for these attribs : */
    # define A_CHARTEXT   (chtype)0x000000ff
#endif
#endif

So: CHTYPE_32 + non-wide: 0-255; CHTYPE_32 + wide: 0-65535, default: more.

The question is: what is actually stored in there? I guess it depends on the UTF8 flag during compile?

Observation with CHTYPE_64 on wincon after getch(), which is in the header defined as wgetch():

  • a ->97 / 0x61, as expected
  • ä when chcp 850 is in effect: 65412 / 0xFF84 (the hex value in codepage 850 is 0x84)
  • ä when chcp 1525 is in effect: 65508 / 0xFFE4 (the hex value in codepage 1525 is 0xE4)

The questions:

  • Would the same value be returned by CHTYPE_32 + wide + non-utf8 version?
  • Would the same value be returned by the VT and/or WinGUI port?
  • Would the same value be returned on GNU/Linux if a matching locale would be used?
  • If there would have been a compilation with UTF8, would the result be 0xC3A4 (unicode value 0x00E4, utf8-value 0xC3A4) - or does it all depends on the Windows Codepage / locale here?
  • Is there a way to query/change the encoding that is used?
  • Is there a clean way to ask for "is that an encoded character" (the quick hack to return the "expected" value to the application was to check for attributes and special values first, then just & 0xFF the return of getch() to ensure an unsigned char - but I guess this may be problematic for some cases)?
  • Any insights how ncurses + ncursesw handle this?
  • Any insights how MinGW's character functions like isupper(), tolower() handle the extended characters and what they would expect to get (I'm quite sure that 0xFFE4 would not be the correct thing to pass there)?

Hmmm... First, I think I'll puzzle through what "ought" to happen. This should apply to PDCurses and PDCursesMod; I'll write PDCursesX to indicate that both flavors work this way.

For input, the CHTYPE setting should not matter at all. WIDE only matters in that if it's not defined, you have no access to get_wch() (wide-character getch()). This appears to be the case.

The behavior should not vary with platform (i.e., if you hit ä in a PDCursesX program, swap in a different shared library compiled with the same WIDE and UTF8 settings, and again hit ä in that program, you should get the same value as before.) The 'Input Test' routine in testcurs is a good way to check this out.

With ncurses (both wide and non-wide), getch() returns non-ASCII characters (on my machine, anyway) in UTF8 : ä gets you 0xc3 followed by 0xa4. I think this is correct; the man page for ncurses wgetch() says

"The get functions are described in the XSI Curses standard, Issue 4. They read single-byte characters only."

...with the idea presumably being that if you really wanted to get the multi-byte value (0xe4 for Unicode if you hit ä), you'd call get_wch(). That function will return a single wide-character value, with no UTF8 involved. (And it does so in PDCursesX and ncurses.)

I think under Windows, getch() would continue to return byte values. You'd get the above two for a UTF8 locale, or you might get 0xe4 in a CP437 locale... but in any case, there would be an iron rule that getch() returns values from 0 to 255 for "normal" characters, and values above that for special (function) keys.

PDCursesX does not do this : the distinction between getch() and get_wch() is almost negligible (the latter returns the 'is function key' boolean). I think we could revise getch() to match ncurses (and therefore the XSI Curses standard), at the cost of probably breaking a good bit of existing code.

I think we could revise getch() to match ncurses (and therefore the XSI Curses standard), at the cost of probably breaking a good bit of existing code.

I guess so, so this should likely only happen with either a define set (via Makefile) or - "as a bug fix to the portable curses" by default with the option to disable it. @wmcbrine What is your take on this?

Should note that this would be quite easy to do. wget_wch() would be our main function for keyboard input, as it probably should be anyway. The ncurses-compatible wgetch(), enabled perhaps with #define NCURSES_GETCH, would look like

  • if we have cached bytes, return one of them
  • otherwise, try to get a wide character from wget_wch().
  • otherwise, convert the character to a multi-byte string, return the first byte of said string, and cache any remaining bytes for the next call(s) to wgetch().

And our "normal" wgetch() would simply call wget_wch() and return its value unaltered, much as wget_wch() currently just returns a value from getch().

A small detail is that if PDC_FORCE_UTF8 were defined, the wide character returned by wgetch() would be converted to multiple bytes with PDC_wc_to_utf8() (i.e., "ignore the locale and gimme a UTF8 string"). Otherwise, we'd use wctomb().

That sounds nice. I think the define could be moved to the application:

  • PDCurses is compiled with the current version as default (so no breakage ahead)
  • gets a new function that does the caching and returning
  • depending on NCURSES_GETCH the application could explicit request ncurses returns (because we'd call a different getch function from the start).

As we'd get 0xFF as the first byte in the case above: What does it stand for? Possibly I've missed a defined for that "key"?

Meant to mention : I don't know why you're getting 0xFFnn. Seems to me that in those code pages, you ought to get single bytes when hitting ä: the values you actually got, without the 0xFF in front. That would apply on any 8-bit (non-wide) build, or on a WIDE=Y build without UTF8=Y.

Add UTF8=Y, and as it stands, hitting ä ought to cause wget_wch() or getch() to return 0xE4, the Unicode value for that character. If we followed the XSI standard, wget_wch() would continue to return that value, but getch() would UTF8-encode it, and hitting ä would cause two bytes to be returned : 0xC3 followed by 0xA4, and the application has to be bright enough to decode the UTF8 stream and realize that ä was hit. (Which is why you'd probably really want to use wget_wch() and get the correct wide character from the beginning, with no need to parse multi-byte encodings.)

  1. The idea that PDCurses' current behavior here is wrong with regard to multibyte sequences (like UTF-8), per X/Open, is incorrect. X/Open says: "These functions read a single-byte character from the terminal associated with the current or specified window. The results are unspecified if the input is not a single-byte character." (emphasis mine). Compatibility with ncurses is another matter, as is as the question of what it "should" do (ambiguous, to me).

  2. One thing you might want to consider is what to do when it's time to put this input back on the screen. That is to say... getstr() will return a UTF-8 string, and addstr() will put one back. With getch(), it perhaps lacks an exactly corresponding output function, but addch() is close. And as it is, a character received via getch() can in fact be displayed via addch(). Now suppose getch() returns pieces of a UTF-8 character. What should addch() do?

  3. An "0xFF" prefix suggests an incorrect sign extension. That would indeed seem to be unintended behavior. But, I'm not getting that from getch() here under X11. I'll check wincon later.

OK, this appears to affect only wincon in narrow (i.e. not PDC_WIDE) mode. Will fix...

Thanks for the fix and the additional clarifications.

So, just for reference, testing shows that the ReadConsoleInputA() sign extension "bug" isn't in XP. Some forum comments suggest that it first appeared in Windows 8.