google/cld3

Korean is detected as so many different language with some symbols

Opened this issue · 0 comments

[1] pry(main)> identifier =  CLD3::NNetLanguageIdentifier.new(1, 2048)
=> #<CLD3::NNetLanguageIdentifier:0x0000557ad8972f68
 @cc=#<CLD3::Unstable::NNetLanguageIdentifier::Pointer address=0x0000557ad8014870>>
[2] pry(main)> identifier.find_language('안녕하세요')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[3] pry(main)> identifier.find_language('A: 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[4] pry(main)> identifier.find_language('A. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.7444548606872559, reliable?=true, proportion=1.0>
[5] pry(main)> identifier.find_language('Q. 안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[6] pry(main)> identifier.find_language('"안녕하세요"')
=> #<struct Struct::Result language=:ko, probability=0.9999847412109375, reliable?=true, proportion=1.0>
[7] pry(main)> identifier.find_language('Q:안녕하세요')
=> #<struct Struct::Result language=:zh, probability=0.9469994902610779, reliable?=true, proportion=1.0>
[8] pry(main)> identifier.find_language('A. 코스프레?')
=> #<struct Struct::Result language=:zh, probability=0.27146071195602417, reliable?=false, proportion=1.0>
[9] pry(main)> identifier.find_language('A. 코스프레?\n마녀 하고 싶어요')
=> #<struct Struct::Result language=:ne, probability=0.9822306632995605, reliable?=true, proportion=1.0>

Korean uses specialised characterset called Hangul(한글) So 1-gram based detection should result almost 100% rate, But it is detected as zh, ne, hi, etc