att/ast

Should we still support EBCDIC?

krader1961 opened this issue ยท 8 comments

Coverity pointed out a condition that is always true on ASCII using systems. Which lead @siteshwar to open PR #741. That transformation, however, is only valid on ASCII using systems. There are other places in the code which attempt to support both ASCII and EBCDIC. See, for example, the definition of getop() in src/cmd/ksh93/sh/streval.c. But given all the places in the code which have hardcoded chars like \033 (escape) my confidence is low that the current code will function correctly in an EBCDIC environment.

If we continue to support EBCDIC it should be done via build time tests such as #if ('a' == 97). Probably coupled with formal abstractions and preprocessor symbols. Personally I would just as soon drop whatever support there is for EBCDIC and formally state that ksh only supports ASCII and encodings like ISO-8859 and UNICODE which are compatible with ASCII. Which basically means everything but EBCDIC.

๐Ÿ‘ from me - EBCDIC appears to be mostly IBM-specific and I would rather devote any energy we'd all expend towards EBCDIC towards better UTF-8/UNICODE support where possible.

There are just a handful of places where something like grep '^#if .*\'' returns matches. And most of those are in src/lib/libast/include/ast_ccode.h. Which seems to do, more or less, what I proposed in my original comment. Assuming we want to retain support for EBCDIC systems. Note that symbols like CC_esc are used elsewhere in the code. Which suggests that the run-time test in src/cmd/ksh93/sh/string.c is a mistake that should have used the symbols from ast_ccode.h.

One option is to replace the problematic run-time test in src/cmd/ksh93/sh/string.c that caused Coverity to report a problem with the CC_esc preprocessor symbol. My preference is to drop support for IBM EBCDIC.

I do not think we should support EBCDIC. And, to be honest, I doubt that the EBCDIC support ever worked in the recent releases of ksh.

Yes, I don't care for EBCDIC support, either. How do we know that it was ever built successfully on EBCDIC machines? It may have been only partly implemented. If bash runs on mainframes, does it support EBCDIC or just ASCII?

How do we know that it was ever built successfully on EBCDIC machines? It may have been only partly implemented.

That's what I'm wondering. There are lots of tricks programmers use that work for ASCII but not EBCDIC. For example, tests like c >= 'a' && c <= 'z'. I wonder how many places in the AST/ksh code rely on those tricks.

Also, per this EE Times article there are 57 variants of EBCDIC. Most of those are the equivalent of ISO-8859-1, ISO-8859-2, etc. (or Windows code pages) to support alphabets in other countries and aren't really relevant to this issue. However, I'm aware of at least two variants that differ with respect to codepoints that might affect ksh.

I would be surprised if anyone objects to our decision of dropping it. Bad day for EBCDIC!

@kdudka's intuition is correct. Not surprisingly some judicious grep'ing does find places in the code which are ipso facto broken on an EBCDIC system. From src/lib/libast/hash/strkey.c:

if (c >= 'a' && c <= 'z')

From src/cmd/ksh93/sh/name.c:

if (*buff >= 'A' && *buff <= 'Z') *buff += 'a' - 'A';

There are others but you get the idea. I'll gin up a PR to remove the non-ASCII getop() implementation and add a clear ASCII or ASCII compatible char set only warning.

While working on #579 to remove the AST locale subsystem I stumbled upon src/lib/libast/include/ast_ccode.h which contains, among other things, this:

#define CC_ASCII 1    /* ISO-8859-1            */
#define CC_EBCDIC_E 2 /* Xopen dd(1) EBCDIC        */
#define CC_EBCDIC_I 3 /* Xopen dd(1) IBM        */
#define CC_EBCDIC_O 4 /* IBM-1047 mvs OpenEdition    */
#define CC_EBCDIC_S 5 /* Siemens posix-bc        */
#define CC_EBCDIC_H 6 /* IBM-37 AS/400        */
#define CC_EBCDIC_M 7 /* IBM mvs cobol        */
#define CC_EBCDIC_U 8 /* microfocus cobol        */

So the situation is even worse than I remembered vis-a-vis incompatible EBCDIC definitions.

My guess is that early in the life of this code there was a strong effort made to support EBCDIC but over time idioms, like those in my previous comment, which only work with ASCII, crept in. When I'm done with #579 I'll remove the ast_ccode.h and uses of the symbols it defines.