jmhodges/rchardet

string is incorrectly identifed as EUC-TW; regression from upstream

coledot opened this issue · 0 comments

I have a string, Se̱ora (hex bytes 53 65 cc b1 6f 72 61). It is intended to read "Señora"; never mind the invalid byte sequence (cc b1) for now. CharDet.detect(s) returns EUC-TW with 99% confidence. This doesn't sound right at all; I would expect UTF-8 or at the very least an ISO-8859-*. Other chardet implementations (uchardet and pychardet) both report UTF-8.

Taking a closer look at the code, I see there's a bit of logic that differs between uchardet and rchardet:
in uchardet, CharDistribution.cpp:

  if (mTotalChars <= 0 || mFreqChars <= mDataThreshold) {
    return SURE_NO;
  }

in rchardet, chardistribution.rb:

      if @totalChars <= 0
        return SURE_NO
      end

It appears that the second part of the check involving @freqChars was dropped somehow. Adding it back in makes rchardet work consistently again (plus specs pass).