Tests fail with ICU 59.1: many ASCII strings are also valid UTF-16
LukeShu opened this issue · 0 comments
LukeShu commented
All ASCII strings of even length are valid UTF-16 (LE and BE). Recent versions of ICU (at least 59.1) recognize this, and detect them as UTF-16BE
and UTF-16LE
, in addition to the ASCII-compatible encodings that older versions returned (it does this for both even and odd-length strings, as it doesn't assume that it is given the complete text, and that an odd-length string could just be truncated).
Given the string test
, ICU 59.1 detects:
name:«ISO-8859-1» (confidence: 60%); language: «en»
name:«ISO-8859-2» (confidence: 60%); language: «ro»
name:«UTF-8» (confidence: 15%); language: «»
name:«UTF-16BE» (confidence: 10%); language: «»
name:«UTF-16LE» (confidence: 10%); language: «»
This breaks the tests that check that test
is detected as ISO-8859-1/ISO-8859-2/UTF-8.
I'm not sure if you want to modify the tests to use a test string that is unambiguously not UTF-16, or if you want to modify the tests to expect the UTF-16 results.