CLD2Owners/cld2

Language Detection with CLD2 with Mixed Inputs in long documents

Closed this issue · 1 comments

Internals Recap. CLD2 is a Naïve Bayesian classifier, trained on documents of mean size of 200 characters, trained on a corpus of 100M scraped and human expert selected web pages.

When working on long documents size like

~3000-4000 words, 
~40-50.000 characters

of mixed input texts (at least 2-3 languages in the same document), I see that CLD fails the recognize all the mixed inputs, resulting in only the most common language, like being having a polarization around this language like in this document excerpt:

Only come and treat me right
And you'll never guilty
Sekarang kamu sudah ada di depanku
Aku pun berdebar menanti kata-katamu
Honey Bunny Sweety
Let's take a chance

This will be recognized as english so I get

{
  "results": [
    {
      "reliable": true,
      "detection": {
        "name": "ENGLISH",
        "code": "en",
        "percent": 54,
        "score": 930
      }
    }
  ]
}

while I would expect here to have at least 2 languages. Internally CLD2 uses NGram decomposition of the input text, that is known to perform very will on language detection in a text: the feature space is very compacted when using a bigram, since the latin alphabet you will get 26^2=676 bigram of possibile features in the training set. See here for more details.

If I generate ngram of a given size (this case N=2) of this document I will get this time

{
    "count": 86,
    "code": "na",
    "name": "na",
    "mean": 0.45989304812834225
  },
  {
    "count": 50,
    "mean": 16.503352692086242,
    "code": "id",
    "name": "INDONESIAN"
  },
  {
    "count": 38,
    "mean": 12.779225483523962,
    "code": "en",
    "name": "ENGLISH"
  },
  {
    "count": 13,
    "mean": 1.5371176291771826,
    "code": "ms",
    "name": "MALAY"
  }

i.e. a more detailed detection of the mixed input language that are in this document. Of course this detection depends on N i.e. the size of the ngrams, so it may happens that for some values of N it have false positive (like a new language detected that it is not in the mixed inputs).

Assumed that CLD2 is using Ngram internally, and that the Bayes classifier was trained on ~200 characters wide text (~2-3 sentences), it seems to have a polarization in some way, and to provide better results, when working in this way - in the case of mixed inputs. The question is if this is arguable in some way and if there is a different approach that could bring to the same results obtained here.

Originally posted here.

[UPDATE]
A good explanation about the mixed input results was given by Dick Sites here