optimaize/language-detector

Wrong text detection in "no sense" text

Opened this issue · 4 comments

Hi,
I'm having a "de" response with > 0.99 score for a text like the following:

6LSHOJDV 5LYR 8LERXSLQ 8UPDV 5DXGVHSS 0DULQH 6\VWHPV ,QVWLWXWH DW 7DOOLQQ 8QLYHUVLW\ RI 7HFKQRORJ\ $%675$&7 7KH RWKHU LPSRUWDQW HQYLURQPHQWDO DVSHFW WKDW QHHGV (QYLURQPHQWDO FRQGLWLRQV ZHUH PRQLWRUHG XVLQJ LQ VLWX FRQWLQXRXV PRQLWRULQJ DUH RLO VSLOOV ,Q WKH *XOI RI )LQODQG PHDVXUHG LQKHUHQW RSWLFDO SURSHUWLHV DQG ZDWHU VDPSOLQJ WKH SUREDELOLW\ RI RLO VSLOOV LV KLJK GXH WR WKH LQFUHDVLQJ RLO WRJHWKHU ZLWK UHPRWH VHQVLQJ LPDJHU\ 0(5,6 DQG $6$5 WUDQVSRUWDWLRQ 6HYHUDO RLO SROOXWLRQ LQFLGHQWV KDSSHQHG LQ LQ 0XXJD %D\ %DOWLF 6HD 6LPXOWDQHRXV PRQLWRULQJ XVLQJ WKH JXOI RYHU WKH ODVW GHFDGH 'LUHFW HQYLURQPHQWDO LPSDFWV GLIIHUHQW PHWKRGRORJLHV JDYH GHWDLOHG RYHUYLHZ RI RI RLO VSLOOV DIIHFW VHDELUGV DQG FRDVWDO HFRORJ\ HVSHFLDOO\ VXVSHQGHG PDWWHU 630 ORDG LQWR WKH ZDWHU FROXPQ GXULQJ ZKHQ WKH VSLOO KLWV WKH VKRUH 7R PLQLPL]H WKH QHJDWLYH HIIHFW WKH GUHGJLQJ RSHUDWLRQV 0(5,6 )56 GDWD HQDEOHG WR RI RLO SROOXWLRQ DQG WR IDFLOLWDWH IDVW DSSOLFDWLRQ RI RLO UHFHLYH WKH GLVWULEXWLRQ RI 630 RQ ZDWHU VXUIDFH 7KH FRPEDWLQJ PHWKRGV DQ HDUO\ GHWHFWLRQ RI RLO VSLOOV DW VHD LV PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV UHYLOHG WKH RI D JUHDW LPSRUWDQFH 0DQ\ VWXGLHV KDYH SURYHG WKDW UDGDU SDUWLFOH FRQFHQWUDWLRQ RQ YHUWLFDO VFDOH %DFNVFDWWHULQJ IURP LPDJHV FDQ SURYLGH LQIRUPDWLRQ RQ SRVVLEOH ORFDWLRQ DQG WKH $6$5 GDWD ZDV LQ FRUUHODWLRQ ZLWK RLO SURGXFWV H[WHQW RI RLO VSLOOV > @ GHWHUPLQHG IURP ZDWHU VDPSOHV ZKHQ EDOODVW ZDWHU GLVFKDUJH &RQWLQXRXV DQG ILQH VFDOH UHPRWH VHQVLQJ LV RQH RI WKH NH\ ZDV GHWHFWHG GXULQJ ILHOG VDPSOLQJ DVSHFWV LQ PRQLWRULQJ RI 630 DQG SRVVLEOH RLO VSLOOV QHDU WKH KDUERUV (QYLVDW 0(5,6 IXOO UHVROXWLRQ GDWD 0(5,6 )56 ,QGH[ 7HUPV 0(5,6 VXVSHQGHG PDWWHU LQKHUHQW DQG (QYLVDW $6$5 LPDJHU\ LV SURYLGHG E\ (6$ GDLO\ EDVHV RSWLFDO SURSHUWLHV DQG JLYHV JRRG EDVHV IRU FRQWLQXRXV PRQLWRULQJ 7KH VFRSH RI WKH FXUUHQW VWXG\ ZDV WR HYDOXDWH WKH XVH RI ,1752'8&7,21 0(5,6 )56 GDWD IRU PRQLWRULQJ RI VXVSHQGHG PDWWHU ORDG WR WKH FRDVWDO VHD GXULQJ WKH KDUERU GUHGJLQJ (QYLVDW $6$5 2QH RI WKH PDLQ FKDOOHQJHV LGHQWLILHG E\ WKH (XURSHDQ 6HD GDWD ZDV XVHG WR HYDOXDWH WKH SRVVLELOLW\ WR GHWHFW WKH RLO 3RUWV 2UJDQLVDWLRQ (632 LQ LWV HQYLURQPHQWDO FRGH VSLOOV (632 ZDV WKH VXVWDLQDEOH GHYHORSPHQW RI VHD SRUWV 0(7+2'6 $FFRUGLQJ WR WKH GRFXPHQW WKH HQYLURQPHQWDO LPSDFWV FDXVHG E\ SRUW UHODWHG DFWLYLWLHV VKRXOG EH UHGXFHG > @ 7KH 7KH ILHOG PHDVXUHPHQWV RI LQKHUHQW RSWLFDO SURSHUWLHV ILUVW VWHS LV WR SURSHUO\ PDQDJH HQYLURQPHQWDO LVVXHV ZKLFK WRJHWKHU ZLWK WDNLQJ ZDWHU VDPSOHV ZHUH SHUIRUPHG LQ UHTXLUHV FRQWLQXRXV HQYLURQPHQWDO PRQLWRULQJ 5HPRWH DQG 0XXJD %D\ RQ DQG

Citing from the front page:

This software cannot handle it well when the input text is in none of the expected (and supported) languages. For example if you only load the language profiles from English and German, but the text is written in French, the program may pick the more likely one, or say it doesn't know. (An improvement would be to clearly detect that it's unlikely one of the supported languages.)

Thanks anyway for the submission. It's a good example to demonstrate a limit of the library.
I believe that it should be detectable in the n-grams that the above is not a good match. And another idea is to cross-check with real words... it doesn't contain any.

My doubt is about the high percentage found, which is over 99%. My expectation would be to have a low percentage.
Thanks

+1. Today the probabilities of all detectable languages always add up to 100%, and the library doesn't even attempt to check if the text makes sense.

Would be interesting to see what's detected for a lower-case version of this text. German nouns are all capitalised, so it's plausible that German is the best match as there's decent scores for capital letter unigrams (moreso than for any other language, at least!). all-caps bigrams and trigrams are probably quite rare in the models.

@andrea-bologna , @djelinski may I ask where you got texts like this from? It seems like strange input (though dictionary validation would be a nice feature)