not-yet-awesome-rust/not-yet-awesome-rust

Code page 437 support for exchanging data with legacy DOS applications

Closed this issue · 7 comments

I need code page 437 encoding/decoding support for exchanging data with applications running inside DOSBox, DOSEMU, and my retro-computing hobby machine. (eg. for generating menus using .bat files containing box-drawing characters or for reading the contents of text files included with or generated by applications.)

Likewise, I'm sure there are legacy systems out there still running DOS on real hardware. (While not an example likely to result in Rust code, my local non-chain used games store still uses a DOS-based point-of-sale system.)

I researched this, but could only find encoding/decoding libraries without cp437 support and a single decode-only cp437 implementation with an uninspiring README that typos it as cp537 in the same screenful which also says cp437 twice.

I can't remember whether it was encoding or encoding_rs but, last time I investigated this, I came away with the impression that "the most ideal-in-the-abstract place to add support will turn away pull requests for code pages beyond those listed in the WHATWG Encoding Standard".

As-is, my best option is to write a stub in some other language like Python or Free Pascal (the latter being the most typesafe thing I know aside from Rust) which is responsible for doing UTF8<->CP437 conversion and parsing/serializing whatever format is used inside the DOS environment... which really hurts the case for using Rust at all in my own projects in this sphere, when any GUI I write is likely to be PyQt-based.

I've started to implement cp437 (just decoding for now, should have encoding done by tomorrow), but I've a couple questions.

Firstly: which variant(s) of cp437 do you need? I'm using cp437_DOSLatinUS, as provided by The Unicode Consortium, but that's different from this Wikipedia table (which I'm adding as well, but the Unicode list was easier to use).

Secondly: do you like the API? does it behave like you'd expect?

I love nitpicks, so send 'em my way :P

The more I wrote, the more occurred to me, so this is a long response...

Firstly: which variant(s) of cp437 do you need? I'm using cp437_DOSLatinUS, as provided by The Unicode Consortium, but that's different from this Wikipedia table (which I'm adding as well, but the Unicode list was easier to use).

  1. Are the differences contained entirely within the "When translating to Unicode it should be noted that some codes do not have a unique, single Unicode equivalent; the correct choice may depend upon context." list in the Wikipedia article?

  2. For my own use, I'd prefer to err on the side of round-trip compatibility with my existing Python scripts, so I'll try to find time to check whether Python's cp437 codec matches cp437_DOSLatinUS.

    Given that I didn't notice that ambiguity, despite having put a fair bit of work into this sort of thing for my hobby projects, it might be a good idea to include a second argument which is used as an explicit dialect specifier, similar to how good Unicode APIs embody the reminder "there is no such thing as 'plain text' by either clearly documenting the single encoding they standardized on (eg. String being UTF-8) or requiring you to explicitly specify the encoding of the text.

Secondly: do you like the API?

  1. It feels serviceable but, on an overall API design level, I'm not an expert in this area, so I won't be able to give you an answer that satisfies me until I've had a little time to poke at other, bigger Rust encoding crates to get a feel for what the more experienced people have come up with.

    For a single-encoding crate (something where something like String::decode(CP437, some_bytes) is overkill), my first impulse would have been to write an extension trait that added String::from_cp437, but I haven't thought about how I'd prefer to handle the Cow option in such a design.

  2. Could you clarify your rationale behind putting everything in a pc module and calling it PcCp437/pc_cp437 everywhere?

  3. If I'm reading the implementation correctly, the is_pc_cp437_or_ascii function is a pointless re-creation of is_ascii() from the standard library.

    https://play.rust-lang.org/?gist=47512268ec45c49b4555d792199d205e&version=stable

    (Also, the name is confusing because, intuitively, it reads as if it should be checking whether trying to encode the input to cp437 would run into a "codepoint has no mapping in cp437" error, not whether it's part of the 7-bit ASCII subset common between cp437 and utf-8.)

  4. Looking at your use of macros, my first impulse is to look into options for using conversion traits to avoid monomorphization bloat.

    (Not too long ago, over-eagerness to use macros was revealed to be the biggest reason clap was bloaty.)

    Of course, I'd also want to benchmark to get a proper idea of whether I was trading off speed for space in any significant way.

  5. Since it's an 8-bit encoding, you're relying on the comprehensiveness of your match block to ensure that the unreachable! never gets hit. I'd feel much safer if you were to include an automated test which verifies that no typos have rendered the unreachable! reachable.

    I didn't get much sleep last night, but here's my best effort to make a test that's as foolproof as possible.

    #[cfg(test)]
    mod tests {
        use super::pc_cp437_to_unicode;
    
        #[test]
        /// Test that all possible u8 inputs safely produce a `char` output
        //  ...and that ones in the ASCII range produce valid output.
        fn test_decode_coverage() {
            for c in 0 .. 0xFFu8 {
                if c.is_ascii() {
                    assert!(pc_cp437_to_unicode(c) == c as char);
                } else {
                    // Compile-time assertion that the output is of type `char`
                    let _: char = pc_cp437_to_unicode(c);
                }
            }
    
            // Test 0xFF separately, both because ..= is unstable and because
            // it's better to test 0xFF twice in the success case than to not
            // test it at all if ..= were in use and the = got deleted.
            let _: char = pc_cp437_to_unicode(0xFFu8);        }
    
        }
    }

    Also, while writing that, it occurred to me that an alternative design that would encode the comprehensiveness invariant in the type system and also be easily extensible would be to take [char; 256] as an argument and then use the u8 as an index into it.

    Not only would that be equivalent to verifying the unreachable! at compile time, it would make it easy to extend the design to support a variety of 8-bit codepages just by feeding in different arrays. (I'm pretty sure I remember seeing other Unicode-related projects having written Python scripts to auto-generate .rs files with similar arrays from Unicode data tables.)

    For example, codepage 850 is similar to 437 but, instead of mathematical symbols and box-drawing characters, it provides an alternative mapping of all the letters with diacritics from ISO 8859-1 (ie. latin1). Likewise, the ROM from the Atari ST line of microcomputers (which always run in graphics mode) replaces the box-drawing characters with Hebrew and the GEM/FreeGEM/OpenGEM GUI for DOS replace them with various international language extensions.

does it behave like you'd expect?

I probably won't have time to try it out today. I'll have to get back to you on that.

Are the differences contained entirely within the list in the Wikipedia article?

No – the Unicode Consortium's list has a 100% overlap in between cp437 and ASCII (up to 0x7F), while Wikipedia's cp437 is full of graphic characters in the 0x00..0x20 range.
The more adequate difference signifier might be

Although the ROM provides a graphic for all 256 different possible 8-bit codes, some APIs will not print some code points, in particular the range 1-31 and the code at 127. Instead they will interpret them as control characters. For instance many methods of outputting text on the original IBM PC would interpret the codes for BEL, BS, CR and LF. Many printers were also unable to print these characters.

explicit dialect specifier

That's basically what I used the pc signifier for – as a dialect specifier outside, rather than inside, the function (⇒ different functions for different dialects). Although argument as dialect specifier is probably a much better idea.

extension trait that added String::from_cp437

Which is pretty much what happened (although the pc is in the function name rather than the argument so it's from_pc_cp437).

rationale behind putting everything in a pc module

That's just a byproduct of how the API evolved while I was writing it, to be corrected.

is_pc_cp437_or_ascii() is a re-creation of is_ascii()

Quite right! I was unaware of is_ascii().

name is confusing

tbf I didn't quite like it either, but y'know "two hardest problems in programming" and it's getting replaced with is_ascii() anyway

macros en general

Just removing the fixed array ([u8; N]) implementations brought the unstripped debug rlib size from 2.8MB to 500k.
Maybe an optional compile feature, like diesel's "large-tables" to enable them, then? I'm definitely gonna be looking into minimising the output size now.

match block comprehensiveness
verifying the unreachable!() at compile time

This test does comprehensiveness, but with the is_ascii() refactor, that unreachable!()'ll probably go away,
That test will be incorporated, nonetheless.

take [char; 256] as an argument

Probably something more complex than that (to allow for leverage of as-UTF-8 borrows and straight moves), but yeah, taking in encoding would be pretty nice.

No – the Unicode Consortium's list has a 100% overlap in between cp437 and ASCII (up to 0x7F), while Wikipedia's cp437 is full of graphic characters in the 0x00..0x20 range.

Ahh. I'd forgotten about that. (Given how frustrated I got as a kid when various attempts to use the wingdings in DOS batch files and QBasic failed, I can only assume that bit of forgetfulness is a sign of how tired I am.)

Definitely a need for a dialect specifier then, since some of my uses will be for APIs that interpret them as control characters, while others will do wingdings, and I want my serializer to be able to catch "cannot represent" errors as reliably as possible.

...but that also means that there are at least two independently variable details to account for. (One or more for the characters in the 0x80..=0xFF range where their intended mapping to Unicode codepoints depends on context and one for to select which of the wingdings or the control characters should result in an "unrepresentable" error and which should map to 0x00..0x20 and 0x7F.)

Perhaps the dialect selector should be a flags field with values such as "map control codes to wingdings" and "Map √ to ✓". (Then the question becomes one of finding the best way to get as much type-safety as possible when it's no longer an "enum, not bool" situation.)

Although argument as dialect specifier is probably a much better idea.

Definitely. Namespacing it externally like that is backwards because...

  1. It's harder for you to share everything but the mapping table between them. (Given that all fully-populated 8-bit encodings use the same algorithms for encoding and decoding)

  2. My code would end up reversing things anyway. (ie. In any situation where I needed to output to more than one dialect, I'd have to hack around it by importing everything you export and then writing a big match statement to map identifiers I define to the associated functions.)

Which is pretty much what happened (although the pc is in the function name rather than the argument so it's from_pc_cp437).

Yeah. It's close... it just feels like the pc is an unnecessary source of confusion because it evokes a reaction of "Codepage 437 for the PC? What other kind is there?! Maybe the author's native language isn't English and it's a linguistic inversion of Code Page or Code Point, like SI being taken from the french for International System [of Units]."

There's a cognitive hierarchy of subsets that String::from_cp437(dialect) would reinforce. (Strings, Strings encoded in cp437, cp437 strings of this specific dialect.)

tbf I didn't quite like it either, but y'know "two hardest problems in programming" and it's getting replaced with is_ascii() anyway

True. If I were still young enough to think that 8-bit ASCII was a thing, I probably would have called it something like is_7bit.

Just removing the fixed array ([u8; N]) implementations brought the unstripped debug rlib size from 2.8MB to 500k.
Maybe an optional compile feature, like diesel's "large-tables" to enable them, then? I'm definitely gonna be looking into minimising the output size now.

Read the monomorphization bloat link I sent you. It's a good starting point.

This test does comprehensiveness, but with the is_ascii() refactor, that unreachable!()'ll probably go away,

True, but relying on an external test data table just feels too fragile for ensuring that an unreachable! remains unreachable.

(Of course, I'm the guy who dreams of a day when someone with more time and expertise writes a CI-suitable static analyzer to report all panics (locally or in dependencies) that may be reachable from main() or pub functions so they can be manually audited and either whitelisted or patched.)

Probably something more complex than that (to allow for leverage of as-UTF-8 borrows and straight moves), but yeah, taking in encoding would be pretty nice.

Naturally. I was just thinking about what that specific function would need to do its job. It could easily be part of something more impressive.

Re: remapping:
I've implemented an API like this (extended usage/behaviour in tests) – is it what you had in mind/do you consider it usable?

I'm not at my most alert right now, but it looks good so far.

Hey, I just want to say thank you for implementing the CP437 create! I needed to write a small BBS related utility and also wanted to learn some Rust. This worked perfectly for my needs :)