Languages accepting arbitrary Unicode should not be marked as SBCS

Question

Languages accepting arbitrary Unicode should not be marked as SBCS

Opened this issue 3 years ago · 0 comments

This program in C (gcc) with 257 distinct random Unicode code points in U+00000–U+FFFFF* is scored as “274 chars, 274 bytes (SBCS)”, even though it can’t possibly fit in any single-byte code page (real or fictional), occupies 1028 bytes on disk, and encodes at least 642 bytes of entropy at an information-theoretical minimum.

main(){puts("񙮟󂞨󌣺򞿣𝚲񯢝񊫓򝗦񘒕򦑊򃺝󗹰󴟆󘭪򢴟󘞑󒩮񮢁󤙭𽊨𿆠򉢵𪯒𦁗񞸛󳞰󥛡𕷙񍁯󝱘𢼒𱲞򅞫󀄵򐽈򱊍󒝐󓬨򩥳󥩛𸣼򳅾𙶆󿟢󱮕񝧯򦢄󜰺򇂢񻜦򇗗󖶊񰏀󄍉񢈉🷜񱞖򑩮򕾝󘤥𼞛󁞀醀񅞒󧊃񭮍𞓭𑂌󽿵𦱷𞩠򻇝𸼊󵼏󗀰񘜋𵾏򲓅𛖞󸰮񝧐󑊔򽭃򓦺⁕𮁸򰆖󃌽򖋞𔟴񙍍沜񃺒󓔱򹟜񬎣񹟝񾊂򙶬𣖝󭲀󹭵򠂚ᄥ򶐄𕓬𙎍󖢂񋅟󼚯󴛆찇򂱪󢮌򊼔񅮀󝐎󇧢򋟜򓉰󳷒񓳾񤢱󭈰򣽃򋹙󵍔򍙣񰂰娹󏦁󒷌򉃭𤐀󇷫𩰤񅏈灑򞟠켲񠜑󥝳∖򆟃𬝇𨌎󊇇𨤾򏟑𓇈񁫑󨖯󙊵񧃰򼽣󚺄𦰝񨣸𿟎򘇒󃇲񤻡񔅡𚬜𣸌𪁟𰒗𾋕𘥅򋰏𜼕񋳼񸈀򂉴𣭄򖘽󐋲򪴟򅥁󗰍򒭀񢳡񈘶󨃵󱋳򾫔𸦑򥏣򷔙򖷤୼𐧙𷯠󄒨󁼱󈛲񉻊뽼񧊃󢷍񾋥򐈪󜀪㝺񖧖𬀛𐶐򉞼󚎓񦉻򝖽󛀻񝠣񨄟ú򋛛񪅰𤗲񘼰󰻿𓂰򡝛񼚨񌝊򖿵𲀃򣫀󔓽񽌷񍣯򲁓󮍛𽄁򿒧󮞩񃰝𩂵񨴆𠶂๢񻽐󵃌󥵊񅕦󚢽𺦣񌹔󾰀񾙨򵴕򲆮񘠀񣖯󗐑񆩊");}

(* I’ve excluded plane 16, U+100000–U+10FFFF, which doesn’t seem to decode properly in TIO URLs.)