JohnSnowLabs/spark-nlp-workshop

Wrong interpretation for Language Detection

veilupt opened this issue · 3 comments

I used to run language detection as sample text mixed with English and French

Steps to Reproduce

Run from this example jupyter notebook https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/annotation/english/language-detection/Language_Detection_and_Indentification.ipynb

Sample text:

Today is the anniversary of the publication of Robert Frost’s iconic poem “Stopping by Woods on a Snowy Evening,” a fact that spurred the Literary Hub office into a long conversation about their favorite poems, the most iconic poems written in English, and which poems we should all have already read (or at least be reading next). Turns out, despite frequent (false) claims that poetry is dead and/or irrelevant and/or boring, there are plenty of poems that have sunk deep into our collective consciousness as cultural icons.Demain, dès l’aube, à l’heure où blanchit la campagne,Je partirai. Vois-tu, je sais que tu m’attends.J’irai par la forêt, j’irai par la montagne.Je ne puis demeurer loin de toi plus longtemps

Result:

`|result|
+------+ |
[en]| |

Environment

  • Spark-NLP version: 2.5.4
  • Apache Spark version: 2.4.4
  • Operating System and version: Ubuntu 18.04 (Google VM)
  • Deployment (Docker, Jupyter, Scala, pip, conda, etc.):Jupyter

I am sorry the question is not clear, what do you expect to be the result of the language for the entire document?

@maziyarpanahi I would expect it to return both en, fr for mixed languages

It won't do that. If the input is document it will take the first 250 words and decides only one language. If the input are sentences but the merge is true then it will merge all the results from all the languages and detects one language for the whole document.
You need to have SentenceDetector and feed that to LanguageDetectorDL and turn off the merge, but even in that situation it will give you sentence by sentence prediction not just two languages.