
Chinese detection is not good

rococode opened this issue · 4 comments

Chinese language detection is super inaccurate.

Past issues showed two issues: Chinese detection fails when English is involved, and short Chinese phrases are detected as Korean.

Technically, multi-lingual texts and short texts are not totally supported. However, this is actually a much more significant issue, as even long texts are entirely misidentified.

Consider the following text:


This is classified as Korean and Japanese. I use an algorithm that breaks down the text into chunks to supplement the full text.

Here is the final classification I got:

{de=2.4541738909762207E-4, ru=2.4541738909762207E-4, ko=22.324138907516115, ja=1.6743056353889012, en=2.4541738909762207E-4, it=2.4541738909762207E-4, fr=2.4541738909762207E-4, es=2.4541738909762207E-4}

Chinese is not even listed once, after the detector considers every subphrase (basically split on punctuation) in the text and the full text itself. So it's not just a specific case, but it seems that standard Chinese is almost never even selected as a possibility, much less the most likely option.

@stormisover points out some weird filtering, could that be the problem? #33 (comment)

Other relevant issues:

#63 #33

Not too keen on using unicode ranges to detect languages so unfortunately gonna have to stick with Google Translate API for now :(


genprofile的时候,中文可能被按字节分割了,而不是按文字, 所以生成的profiles是错误的

I've seen poor CJK detection with Optimaize too. I'm running the text in the first post of this issue through Lingua language detector for #107.

isoCode639_1 = {IsoCode639_1@8287} "zh"
isoCode639_3 = {IsoCode639_3@8288} "zho"
alphabets = {Collections$SingletonSet@8289}  size = 1
 0 = {Alphabet@8294} "HAN"
uniqueCharacters = ""
name = "CHINESE"

Very easy to use library.
You say you're using Google Translate API, so it may be possible for you to use Lingua (you're not constrained to Optimaize).

  • Not-normalized (see #86 ), forced short text ( #63 ): 0.999999 zh-CN
  • Normalized, forced short text: 1.0 zh-CN
  • Not-normalized, not forcing short text: 0.9999 zh-CN (this should be your case)
  • Normalized, not forcing short text: 0.7414 zh-CN, 0.2857 ko

@rococode I actually expect you to have 99% accuracy with defaults for this text, using a standard detector.

This text is 293 chars long. #63 saw ~150 char strings (mostly Chinese, some English) get strange results if not forcing short-text language detection.