brianmario/charlock_holmes

Tests fail with ICU 59.1: many ASCII strings are also valid UTF-16

LukeShu opened this issue · 0 comments

All ASCII strings of even length are valid UTF-16 (LE and BE). Recent versions of ICU (at least 59.1) recognize this, and detect them as UTF-16BE and UTF-16LE, in addition to the ASCII-compatible encodings that older versions returned (it does this for both even and odd-length strings, as it doesn't assume that it is given the complete text, and that an odd-length string could just be truncated).

Given the string test, ICU 59.1 detects:

name:«ISO-8859-1» (confidence: 60%); language: «en»
name:«ISO-8859-2» (confidence: 60%); language: «ro»
name:«UTF-8» (confidence: 15%); language: «»
name:«UTF-16BE» (confidence: 10%); language: «»
name:«UTF-16LE» (confidence: 10%); language: «»

This breaks the tests that check that test is detected as ISO-8859-1/ISO-8859-2/UTF-8.

I'm not sure if you want to modify the tests to use a test string that is unambiguously not UTF-16, or if you want to modify the tests to expect the UTF-16 results.