string is incorrectly identifed as EUC-TW; regression from upstream
coledot opened this issue · 0 comments
coledot commented
I have a string, Se̱ora
(hex bytes 53 65 cc b1 6f 72 61
). It is intended to read "Señora"; never mind the invalid byte sequence (cc b1
) for now. CharDet.detect(s)
returns EUC-TW with 99% confidence. This doesn't sound right at all; I would expect UTF-8 or at the very least an ISO-8859-*. Other chardet implementations (uchardet and pychardet) both report UTF-8.
Taking a closer look at the code, I see there's a bit of logic that differs between uchardet and rchardet:
in uchardet, CharDistribution.cpp:
if (mTotalChars <= 0 || mFreqChars <= mDataThreshold) {
return SURE_NO;
}
in rchardet, chardistribution.rb:
if @totalChars <= 0
return SURE_NO
end
It appears that the second part of the check involving @freqChars
was dropped somehow. Adding it back in makes rchardet work consistently again (plus specs pass).