Languages accepting arbitrary Unicode should not be marked as SBCS
Opened this issue · 0 comments
andersk commented
This program in C (gcc) with 257 distinct random Unicode code points in U+00000–U+FFFFF* is scored as “274 chars, 274 bytes (SBCS)”, even though it can’t possibly fit in any single-byte code page (real or fictional), occupies 1028 bytes on disk, and encodes at least 642 bytes of entropy at an information-theoretical minimum.
main(){puts("𝚲𪯒𦁗𢼒𱲞醀𞓭𑂌𦱷⁕𮁸沜𣖝ᄥ찇娹𤐀𩰤灑켲∖𬝇𨌎𨤾𓇈𦰝𣸌𪁟𰒗𘥅𜼕𣭄𐧙뽼㝺𬀛ú𤗲𓂰𲀃𩂵𠶂");}
(* I’ve excluded plane 16, U+100000–U+10FFFF, which doesn’t seem to decode properly in TIO URLs.)