Bill-Gray/PDCursesMod

different return codes and display in UTF8=Y builds between wincon and wingui

GitMensch opened this issue · 20 comments

Tested with current release compiled with UTF8=Y WIDE=Y, using a getch loop (because of used defines, that's a wgetch), then checking the output afterwards with the test program
getme.c, compiled with gcc -g -o getme -DPDCURSES -DPDC_WIDE -DPDC_FORCE_UTF8 getme.c -lpdcurses, executed with chcp 1252, LANG=de_DE.ISO-8859-15@euro, entering ä ö ß @ €.

result old pdcurses wincon (MSYS, possibly non wide/utf8):
all characters are displayed during entering, the values have unexpected results as follows
65508 65526 65503 64 65408

result recent release wincon (MSYS2 mingw32):
there are different characters displayed during entering ¦ ￶ ￟ @ タ [those change depending on the active chcp], all values have the old wincon result:
65508 65526 65503 64 65408

result recent release wingui (MSYS2 mingw32):
all characters are displayed during entering, mostly all values have the "expected" result as follows
228 246 223 64 8364

result recent release wingui (MSYS2 mingw32, with LANG=de_DE.UTF-8 and chcp 65001) - identical

result recent release wincon (MSYS2 mingw32, with LANG=de_DE.UTF-8 and chcp 65001):
there are different characters displayed during entering ᅢᄂ ᅢᄊ ᅢ゚ @ , all values have doubled result (€ should not be pressed, see #246):
65475 + 65444 65475 + 65462 65475 + 65439 64

Issues:

  • Shouldn't wincon and wingui have the same display and the same return codes?
  • Shouldn't wincon display the correct characters during entering?
  • Why does wincon return the additional 65475 before the actual value on each keypress?
  • Is it reasonable to return the decimal UTF16 value for all those characters including in wingui when doing a wide+utf8 build?

Some overlap here with issue #209. Quite a bit, actually.

As William noted in that issue, getch() will work with single-byte input; "The results are unspecified if the input is not a single-byte character."

To me, this means that if you've set a particular code page, anything you enter that fits that code page ought to get the correct byte. (So in your examples, ä ö ß @ €, they all ought to work; they're in code page 1252. If you enter, say, фтш, none of which are in CP1252, "the results are unspecified.")

Your first set of results has the ones-padding issue noted in issue #209 and fixed (I think) in commit 83dcf79 . You got FFE4 FFF6 FFDF 40 FF80; you should have gotten E4 F6 DF 40 80.

I've not checked, but I assume code pages make WinCon input a dog's breakfast. If you compile WinCon with UTF8=Y but have CP1252 set, you'll get input values from Windows that are from that code page that don't line up with Unicode. Basically, using UTF8=Y and a UTF locale should work, and using CP1252 and an 8-bit character build should work, but you can't mix and match.

Also, note that in getme.c, at line 23, you set linep = line, and then set line = malloc( 100); It ought to be the other way around. gcc warned me about this :

getme.c: In function ‘mygetline’:
getme.c:22:20: warning: ‘line’ is used uninitialized in this function [-Wuninitialized]
   22 |     char * line, * linep = line;
      |                    ^~~~~

Yes, that was a copy + paste error, thanks for pointing this out, fixed.

But this question was about the difference between wincon and wingui :-) when both are built the same way.
Notes:

  • The code is actually not using getch but wgetch because of the defines in pdcurses.h (I've explicit noted them in the compile command).
  • I've tried UTF8 codepage 65001 but that chrashed the terminal when inputting €, but I've now added the (quite bad) result in the topic starter. To make it clear: wincon + utf8 build with utf8 codepage is providing an additional "HALFWIDTH HANGUL LETTER AE" before more "HALFWIDTH HANGUL ..." values for the "extended characters"...

On another point... your final result, with LANG=de_DE.UTF-8 and chcp 65001, is actually getting UTF8 returned. For example, ä = U+00E4, and is encoded in Unicode as the two bytes C3 A4. We're getting the 16-bit values out of the structure when we should be getting the 8-bit values, so they get padded to FFC3 FFA4 = 65475 65444.

ö and ß (and anything with a Unicode point less than 0x800) will also be expanded to two bytes each in UTF-8. @, being plain ASCII, will remain one byte. But will be expanded to three. That may be a clue to the crashing mystery (or may not be).

your final result, with LANG=de_DE.UTF-8 and chcp 65001, is actually getting UTF8 returned. For example, ä = U+00E4, and is encoded in Unicode as the two bytes C3 A4. We're getting the 16-bit values out of the structure when we should be getting the 8-bit values, so they get padded to FFC3 FFA4 = 65475 65444.

Not sure: Do we agree that the current result is wrong and we should get C3 A4 -> a single 50084 ?

Do we also agree that wingui should return UTF-8 values instead of the current UTF-16 decimals when built with UTF8=Y?

The more I look into this, the less I know.

About the only thing I do know is that returning FFC3 FFA4 is absolutely Wrong. You could at least make an argument for C3 A4, and a stronger one for E4. The FFC3 FFA4 is simply an artifact of the Windows console mangling the sign extension.

I think the results should be as follows :

  • From getch(), you should get single-byte results. If the current code page is set to be CP 1252, should get you 0x80, whether it's a UTF8 or 8-character build. If you've set a Unicode locale and it's a multi-byte sequence, "the results are unspecified". I would never use getch() except for 7-bit ASCII input, given these various headaches... but it should at least return correct 8-bit output for characters in the current code page.
  • As noted in issue #209, if you use getch() and hit in ncurses with a Unicode locale set, you get the three-character UTF8 sequence. This makes a good bit of sense; if you're in a UTF8 locale, you expect UTF8 results, and there would be no other way to return the full range of Unicode in one-byte pieces. But it is entirely outside the specification, and I don't really consider it to be a good idea; in such a situation, you ought to use get_wch().
  • Interestingly, I can't tell you what ncurses does with get_wch() for non-Unicode locales (such as the default C locale). On my Linux box, using the default C locale, it locks up the moment you enter a non-ASCII key. (getch() is fine in such situations.) On FreeBSD, I get three UTF8 bytes, no matter what the locale is. Still investigating, but I think I'll have to ask about this on bug-ncurses. (Nothing in the ncurses docs, as far as I can tell, mentions code pages or locales in reference to either getch() or get_wch().)
  • get_wch(), which only exists in wide builds, ought to return the correct Unicode (not UTF8) value. should cause get_wch() to return 0x20AC. Actually, I'm not sure what it should do if you enter with code page 1252, nor what should happen if you enter, say, ⅔ (a character not represented in CP1252).
  • I'm inclined to say that use of non-UTF8 locales is unsupported if you're using a UTF8 build of PDCursesMod, and use of UTF8 locales is unsupported if you're using a non-UTF8 build. I can't come up with a reason to mix and match them, and doing so just leads to headaches.

If the above sounds confused, it's because... I'm confused. From the lack of any mention of locales or code pages in Curses documentation (as far as I've seen), I don't think much of anything is really specified here. Obviously, the flawed sign extension in the Windows console doesn''t help matters.

returning FFC3 FFA4 is absolutely Wrong. You could at least make an argument for C3 A4, and a stronger one for E4.

So "yay", we agree.
If I understand you correct wgetch () should give use C3 A4; I think it is not reasonable to ask for getch () getting us E4 when a UTF8 locale/chcp is active (how should we know about the necessary translation) - and if we have wide mode defined (which should be the case when building wide or utf8) then we actually #define getch wgetch, so we would not commonly use that at all.

If the current code page is set to be CP 1252, should get you 0x80, whether it's a UTF8 or 8-character build.

Once this is true there is no need to build a non-wide version any more and people can just always build with UTF8=Y, that would be actually what ncurses does (more or less). I'd be very happy when this is reached (and at least from the GnuCOBOL user base that does use PDCurses I now they'd be, too).

Should we track this in a separate issue, either both in one or one for each?

The original point for this issue was that wingui always returns the utf16 (I guess windows-native) decimal value, which seem to not match anywhere.

Been distracted by day-job activity, but just got a few minutes to dig into this. I don't have a fix quite yet, but thought I'd write down what I found out, partly to make sure I don't just forget it all.

The failure on the UTF8 side struck me as particularly odd. If we got that working, we could always request Unicode input, then use WideCharToMultiByte to get the equivalent in the current locale. (Despite the misleading name, that function will get us correct results on double-byte systems.) So I focussed on Unicode input.

I found that get_wch() got me the correct Unicode value (at least for the entries I tried) when compiled with Digital Mars or MSVC. MinGW got odd results for anything not in 7-bit ASCII. With CP437 (I got other results with other code pages) :

Char MinGW actual Unicode
203f 20ac
213f 2154
½ ab bd
213f 2155
£ 9c a3
213f 2150
π 3e3 3c0
á a0 e1

The high byte was always correct. The lower eight bits were reset.

I'll investigate further. It appears we should be able to get this to work, except on MinGW, where we're stuck until they get console Unicode input right. (Should note that it's only console input that's messed up in MinGW. Using MinGW to build the WinGUI port, things work correctly.)

I guess you mean the "plain old MinGW" - what about the MSYS2 environment?

Haven't used that. I expect the compiler in MSYS2 would create the same code as MinGW run from Linux. That is to say, I think it's basically the same compiler, just run on a different platform.

However, I'm installing MSYS2 on my token Windows 1 0 box and will see what happens there. (It's taking a while. The machine in question hasn't been booted for a while, and Microsoft's top priority is updating, and it will not be denied. That's not leaving many CPU cycles for anything else.)

Compiling the test code with gcc from MSYS2 also got wrong, though more consistent, results. (Should note that MSYS2 is using gcc-12.2.0; on my Xubuntu 20.04 box, MinGW has gcc-9.3.)

With MinGW, for any Unicode entry, bit 7 gets extended over bits 8-15. I suspect this means "cast to a char and then do sign extension." Examples and the logic behind it :

U+20AC - > AC -> FFAC
U+2036 -> 36 -> 0036
U+00C1 -> C1 -> FFC1

if( unicode_val & 0x80)   /* bit 7 set */
   return( unicode_val | 0xFF00);  /* value will be 0xFF80 - 0xFFFF */
else
   return( unicode_val & 0xff);   /* value will be 0x0 - 0x7F */

Basically, we're stuck with borked results in MinGW/MSYS2-land until this bug in MinGW is fixed. Fortunately, we do get correct results from wget_wch() on all other compilers.

That leaves us with one problem we can actually fix, at least on other compilers : getting getch() to return the correct, locale-specific result if the Unicode character can be mapped into the current locale in one byte. I think for that, the logic will be : if you take the wide character and call WideCharToMultiByte and get one character as a result, return that character. Otherwise, probably best to return nothing (though we could emulate ncurses and return all the bytes; the behavior here is unspecified.)

Thank you for looking at this. "Broken extended character input" is obviously a big issue for a lot of people.

Basically, we're stuck with borked results in MinGW/MSYS2-land until this bug in MinGW is fixed.

:-( So you see no option to bit-fiddle the result back to the "expected" one?

Are you sure that there's a MinGW bug and is there an issue for it "upstream" (otherwise it would be unreasonable to expect a fix)?

The interesting part is that ncurses on MinGW/MSYS2 get relevant and "correct" results there (always mapped against LANG [I guess more specific LC_CTYPE]).

:-( So you see no option to bit-fiddle the result back to the "expected" one?

None, since the actual information has been lost (some key hits map to the same values). However...

The interesting part is that ncurses on MinGW/MSYS2 get relevant and "correct" results there

which is a very interesting part, as it suggests to me there's some other non-mangled way to get the information. And looking at the ncurses-6.3 source, it doesn't appear they pay any attention to the uChar.UnicodeChar field. Still investigating... the ncurses code is a little hard to follow in places! (In fairness, people have probably said that about some of my code...)

Okay, I've found and fixed at least a bug, and I rather suspect it is the bug. See commit e54d03f . Without this, UNICODE would only get #defined if you used both WIDE=Y and UTF8=Y; just using the latter was insufficient. And without UNICODE being #defined, 8-bit functions got used where wide-character functions should have been used (e.g., ReadConsoleInputA instead of ReadConsoleInputW).

This seems to fix some (probably all) of the issues when compiling with MinGW on my Xubuntu box and running from within Wine. Various other checks remain before closing this issue as 'fixed', but I expect that to happen.

This was my bug, and therefore doesn't/didn't affect PDCurses or anything else. (In hindsight, I should have tried the test code with PDCurses; had I seen correct results from that, it would have saved me quite a bit of hunting about in wrong places!)

Sounds reasonable, but as the ticket was created with a test compiled with UTF8=Y WIDE=Y I guess there's more (but totally possible that this solves the "bad mingw/MSYS2" bug you've thought to have found).

Compile with those flags, and you should get Unicode results from wget_wch(). To get results for your current code page, we'd then convert the Unicode result using WideCharToMultiByte. Or, probably a better idea, just compile without WIDE=Y UTF8=Y. (I've tried this, and if you do that, Windows does the conversion to your current code page for you.)

I think it's reasonable for us to just say that if you want wide-character input, compile with WIDE=Y UTF8=Y and use wget_wch() and related functions; if you want 8-bit character input in your locale, build without those flags and use getch().

There is at least one more "gotcha". Here, we take the (signed, 8-bit keycode) and cast it to an integer. If the high bit was set, that gets extended. So you hit, say, ½, and (with code page 437) you get -85. Cast that to a 16-bit unsigned value, and you get 65536-85 = 65451. And similarly for anything that isn't 7-bit ASCII.

The tempting solution is to change the aforementioned line to read

 (unsigned char)KEV.uChar.AsciiChar;

Possibly a required solution; I don't think getch() is supposed to return a negative value except for errors. This fix, by the way, does apply to both PDCurses and PDCursesMod.

(Warning : gotta dive into some code today before a meeting later on, so I'll be incommunicado, probably until tomorrow.)

I don't think getch() is supposed to return a negative value except for errors.

That is wrong, getch() is supposed to return "an integer value" (positive or negative don't matter) which represents "a single-byte character" or ERR (a define that could be anything) on error (mostly, but not only defined to happen on nodelay mode with no data available) or - if keypad() is enabled a single-byte character.
So we do have 0x00-0xFF for the character (no matter if we count that as signed or unsigned) + ERR + KEY_ constants in the higher values.
½ (with code page 437) must not be 65451 - because that isn't 0x00-0xFF. If this would be solved by the cast above - than this sounds good.

I think the biggest "side issue" is: getch (in our case actually wgetch(stderr)) should still get that single byte - even when the library has wide and unicode support - as long as the input comes from a single byte codepage like 437.

The part that is "undefined" is what happens when the input is a multibyte character; in this case the ncurses solution of "caching" each byte and return it in multiple calls is reasonable, returning all bytes in the int at once (I think this is the intended behavior of PDCurses*) would be "quite fine", too.

Just drop a note when you think there's a reasonable state and I'll redo some tests using the CI builds with the getme.c from above (with an additional run that uses wget_wch for unicode/wide builds).

I am reasonably sure that commit e86d9c3 fixed this. WinCon compiles with UTF8=Y will get correct results from wget_wch(); for those without, getch() will return code page appropriate results for keys in that code page, and undefined results for anything else. Just need time to try it all out on a for-real Windows system (instead of Wine), after which this can be closed.

Closing this as completed. Note that commit e86d9c3 implements the bit described above of casting KEV.uChar.AsciiChar to type unsigned char. That was the final piece of that puzzle.

I can confirm that:

  • using WIDE=Y UTF8=Y (and chcp 65001) I get the same result in wincon an wingui now (the correct values that wingui got before)
  • both wingui and wincon display those correctly when pressing the keys

Also verified: When using non-wide non-utf8 build the output and the expected values are correctly translated into the active chcp if it isn't 65001.