pemistahl/lingua

Chinese and Japanese word special casing is obsolete

Marcono1234 opened this issue · 1 comments

It appears 362e4e9 made parts of the Chinese and Japanese word special casing in LanguageDetector obsolete. Now all Chinese and Japanese (and Korean) chars are treated as separate words, therefore the following check will never succeed since a char will be either considered Chinese or Japanese, but not both:

} else if (wordLanguageCounts.containsKey(CHINESE) && wordLanguageCounts.containsKey(JAPANESE)) {
totalLanguageCounts.incrementCounter(JAPANESE)

(However the totalLanguageCounts check a few lines below is still necessary)

Also for what it is worth, these lines were probably bugged because (but that does not matter much now anymore):

  • It did not check whether wordLanguageCounts.size == 2, so when previously a 'word' consisted of one Chinese, one Japanese and 99% other chars it would have counted as Japanese
    LanguageDetectorBuilder.fromLanguages(Language.GREEK, Language.JAPANESE, Language.CHINESE).build()
      .computeLanguageConfidenceValues("Παπασταθόπουλου番き")
    Edit: Lingua still behaves this way when there is no space between the Greek and Japanese chars, but that is probably acceptable since in well formed text there would probably be a space between them. With a space between them it would detect the different languages (Chinese, Japanese, Greek) (currently blocked by #105, so only returns Greek).
  • It would have incremented Japanese (and eventually returned it as detected language) even if it was not part of the requested languages set
    LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN).build().detectLanguageOf("番好き")

I've removed the obsolete business logic. Thanks.