landrok/language-detector

wrong language detection

FLasH3r opened this issue · 3 comments

I have the following text with the corresponding language as detected by this package (all English)
Only the bold text is correct.

  • Announcing the GitHub Education Classroom Report 2020 - en
  • Highlights from Game Off 2020 - en
  • How to launch a tech career in 2021 - it
  • Let’s talk about securing open source projects - tl
  • Git clone: a data-driven study on cloning behaviors - tl
  • Get up to speed with partial clone and shallow clone - it
  • GitHub joins amicus brief warning of systemic risk from private sector offensive actors - af
  • Visualizing GitHub’s global community - tl
  • How we built the GitHub globe - en
  • How to make DevOps your competitive advantage - pt

besides using composer install ... I have done anything

The text here is just an example, it's from github blog (title of the last 10 posts)

if I do new \LanguageDetector\LanguageDetector(null,['en']); it will work, but that is not the goal.

the code looks like this:

$languageDetector = new \LanguageDetector\LanguageDetector();

foreach($titles AS $title) {

    $languages = $languageDetector->evaluate($title)->getLanguage();

    echo $title.' - '.(string)$languages.PHP_EOL;
}

Looks like this suffers from the same thing as the more popular https://github.com/patrickschur/language-detection

It does a good job with long texts but is borderless useless for short sentences.. getting it wrong at an alarmingly high rate

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

@vesper8 https://github.com/fntlnz/cld2-php-ext works good for my use-cases also with rather short texts. It detects all the above cases as English