jawah/charset_normalizer

[DETECTION] Can't reproduce documentation example.

ericlingit opened this issue · 1 comments

I copied the Handling Result documentation example and ran it, but got a different result:

my_byte_str = '我没有埋怨,磋砣的只是一些时间。'.encode('gb18030')

# Assign return value so we can fully exploit result
result = from_bytes(
    my_byte_str
).best()

print(result.encoding)

I expect gb18030, but got cp949.

Saving the example text to a gb18030-encoded text file, and running normalizer also yields the same result (cp949).

Verbose output
Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

2022-08-31 12:16:33,511 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (32 byte(s) given) parameters.
2022-08-31 12:16:33,512 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)
2022-08-31 12:16:33,513 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xce in position 0: invalid continuation byte
2022-08-31 12:16:33,514 | Level 5 | Code page big5 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,516 | Level 5 | big5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2022-08-31 12:16:33,518 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,518 | Level 5 | big5hkscs was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2022-08-31 12:16:33,519 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 85.200000 %.
2022-08-31 12:16:33,519 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,520 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 312.500000 %.
2022-08-31 12:16:33,520 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,521 | Level 5 | cp1250 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 193.100000 %.
2022-08-31 12:16:33,521 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 109.000000 %.
2022-08-31 12:16:33,522 | Level 5 | cp1252 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 308.400000 %.
2022-08-31 12:16:33,522 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 1: character maps to <undefined>
2022-08-31 12:16:33,523 | Level 5 | cp1254 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 308.400000 %.
2022-08-31 12:16:33,523 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xca in position 20: character maps to <undefined>
2022-08-31 12:16:33,524 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 106.200000 %.
2022-08-31 12:16:33,525 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa1 in position 30: character maps to <undefined>
2022-08-31 12:16:33,525 | Level 5 | cp1258 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 270.500000 %.
2022-08-31 12:16:33,526 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,526 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xce in position 0: character maps to <undefined>
2022-08-31 12:16:33,527 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 287.500000 %.
2022-08-31 12:16:33,527 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,528 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 337.500000 %.
2022-08-31 12:16:33,528 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,529 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 419.500000 %.
2022-08-31 12:16:33,530 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 318.800000 %.
2022-08-31 12:16:33,531 | Level 5 | cp857 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 546.100000 %.
2022-08-31 12:16:33,531 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,531 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,532 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,532 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,532 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,533 | Level 5 | cp864 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 192.000000 %.
2022-08-31 12:16:33,534 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,534 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,535 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 331.200000 %.
2022-08-31 12:16:33,535 | Level 5 | Code page cp932 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,536 | Level 5 | cp932 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 205.200000 %.
2022-08-31 12:16:33,536 | Level 5 | Code page cp949 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,537 | Level 5 | cp949 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,537 | Level 5 | cp949 should target any language(s) of ['Korean']
2022-08-31 12:16:33,537 | Level 5 | Code page cp950 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,537 | Level 5 | cp950 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2022-08-31 12:16:33,537 | Level 5 | Code page euc_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,538 | Level 5 | euc_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2022-08-31 12:16:33,538 | Level 5 | Code page euc_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,538 | Level 5 | euc_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 25.000000 %.
2022-08-31 12:16:33,538 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xa3 in position 10: illegal multibyte sequence
2022-08-31 12:16:33,539 | Level 5 | Code page euc_kr is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,539 | Level 5 | euc_kr passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,539 | Level 5 | euc_kr should target any language(s) of ['Korean']
2022-08-31 12:16:33,539 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,540 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,540 | Level 5 | gb18030 should target any language(s) of ['Chinese', 'Classical Chinese']
2022-08-31 12:16:33,540 | Level 5 | Code page gb2312 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,541 | Level 5 | gb2312 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,541 | Level 5 | gb2312 should target any language(s) of ['Chinese', 'Classical Chinese']
2022-08-31 12:16:33,541 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,541 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,541 | Level 5 | gbk should target any language(s) of ['Chinese', 'Classical Chinese']
2022-08-31 12:16:33,542 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 259.000000 %.
2022-08-31 12:16:33,543 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,543 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,543 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,544 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,544 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,544 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,545 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,545 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,546 | Level 5 | iso8859_10 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 183.200000 %.
2022-08-31 12:16:33,546 | Level 5 | iso8859_11 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,548 | Level 5 | iso8859_11 should target any language(s) of ['Thai']
2022-08-31 12:16:33,548 | Level 5 | iso8859_13 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 232.300000 %.
2022-08-31 12:16:33,549 | Level 5 | iso8859_14 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,549 | Level 5 | iso8859_15 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,549 | Level 5 | iso8859_16 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 201.100000 %.
2022-08-31 12:16:33,550 | Level 5 | iso8859_2 is deemed too similar to code page cp1250 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,550 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 2: character maps to <undefined>
2022-08-31 12:16:33,550 | Level 5 | iso8859_4 is deemed too similar to code page iso8859_10 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,551 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 136.400000 %.
2022-08-31 12:16:33,551 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb9 in position 9: character maps to <undefined>
2022-08-31 12:16:33,551 | Level 5 | Code page iso8859_7 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xd2 in position 1: character maps to <undefined>
2022-08-31 12:16:33,552 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xce in position 0: character maps to <undefined>
2022-08-31 12:16:33,552 | Level 5 | iso8859_9 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,552 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xce in position 0: illegal multibyte sequence
2022-08-31 12:16:33,553 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 287.300000 %.
2022-08-31 12:16:33,553 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,553 | Level 5 | latin_1 is deemed too similar to code page cp1252 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,554 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 126.700000 %.
2022-08-31 12:16:33,554 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 102.700000 %.
2022-08-31 12:16:33,555 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 256.700000 %.
2022-08-31 12:16:33,555 | Level 5 | mac_latin2 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 80.000000 %.
2022-08-31 12:16:33,556 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2022-08-31 12:16:33,556 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2022-08-31 12:16:33,557 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2022-08-31 12:16:33,557 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xf1 in position 7: illegal multibyte sequence
2022-08-31 12:16:33,557 | Level 5 | Code page shift_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,558 | Level 5 | shift_jis_2004 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 172.400000 %.
2022-08-31 12:16:33,558 | Level 5 | Code page shift_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,558 | Level 5 | shift_jisx0213 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 172.400000 %.
2022-08-31 12:16:33,559 | Level 5 | tis_620 passed initial chaos probing. Mean measured chaos is 0.000000 %
2022-08-31 12:16:33,559 | Level 5 | tis_620 should target any language(s) of ['Thai']
2022-08-31 12:16:33,559 | Level 5 | Encoding utf_16 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2022-08-31 12:16:33,560 | Level 5 | Code page utf_16_be is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,560 | Level 5 | utf_16_be was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 47.100000 %.
2022-08-31 12:16:33,560 | Level 5 | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2022-08-31 12:16:33,561 | Level 5 | utf_16_le was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 171.900000 %.
2022-08-31 12:16:33,561 | Level 5 | Encoding utf_32 wont be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2022-08-31 12:16:33,561 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2022-08-31 12:16:33,561 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2022-08-31 12:16:33,562 | Level 5 | Code page utf_7 does not fit given bytes sequence at ALL. 'utf7' codec can't decode byte 0xce in position 0: unexpected special character
2022-08-31 12:16:33,562 | DEBUG | Encoding detection: Found cp949 as plausible (best-candidate) for content. With 2 alternatives.
{
    "path": "/home/eric/Documents/xxx-charset-normalizer/my-file.txt",
    "encoding": "cp949",
    "encoding_aliases": [
        "949",
        "ms949",
        "uhc"
    ],
    "alternative_encodings": [
        "euc_kr"
    ],
    "language": "Korean",
    "alphabets": [
        "CJK Symbols and Punctuation",
        "CJK Unified Ideographs",
        "Halfwidth and Fullwidth Forms",
        "Hangul Syllables"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

Expect gb18030, got cp949.

Desktop (please complete the following information):

  • OS: Ubuntu Linux 20.04, 64-bit
  • Python 3.8.10
  • charset-normalizer 2.1.1

Good catch, unfortunately, I should not have put that example.
Too few bytes to take a good/reasonable guess. Updated to be more reasonable.

I can't waste time trying to figure out a way to guess correctly on those takes.