aadsm/jschardet

`utf8prober` confidence function magic number "6" breaks short UTF-8 detection.

lingsamuel opened this issue · 0 comments

// src.utf8prober.js
this.getConfidence = function() {
        var unlike = 0.99;
        if( this._mNumOfMBChar < 6 ) {
            for( var i = 0; i < this._mNumOfMBChar; i++ ) {
                unlike *= ONE_CHAR_PROB;
            }
            return 1 - unlike;
        } else {
            return unlike;
        }
    }

This magic number makes UTF-8 text shorter than 6 chars confidence never defeat others.

A simple fix is add multibytes chars ratio check.