Chinese and Japanese word special casing is obsolete
Marcono1234 opened this issue · 1 comments
Marcono1234 commented
It appears 362e4e9 made parts of the Chinese and Japanese word special casing in LanguageDetector
obsolete. Now all Chinese and Japanese (and Korean) chars are treated as separate words, therefore the following check will never succeed since a char will be either considered Chinese or Japanese, but not both:
lingua/src/main/kotlin/com/github/pemistahl/lingua/api/LanguageDetector.kt
Lines 258 to 259 in 7e415ae
(However the totalLanguageCounts
check a few lines below is still necessary)
Also for what it is worth, these lines were probably bugged because (but that does not matter much now anymore):
- It did not check whether
wordLanguageCounts.size == 2
, so when previously a 'word' consisted of one Chinese, one Japanese and 99% other chars it would have counted as JapaneseEdit: Lingua still behaves this way when there is no space between the Greek and Japanese chars, but that is probably acceptable since in well formed text there would probably be a space between them. With a space between them it would detect the different languages (Chinese, Japanese, Greek) (currently blocked by #105, so only returns Greek).LanguageDetectorBuilder.fromLanguages(Language.GREEK, Language.JAPANESE, Language.CHINESE).build() .computeLanguageConfidenceValues("Παπασταθόπουλου番き")
- It would have incremented Japanese (and eventually returned it as detected language) even if it was not part of the requested languages set
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN).build().detectLanguageOf("番好き")
pemistahl commented
I've removed the obsolete business logic. Thanks.