wrong language detection

Question

wrong language detection

FLasH3r opened this issue 4 years ago · 3 comments

I have the following text with the corresponding language as detected by this package (all English)
Only the bold text is correct.

Announcing the GitHub Education Classroom Report 2020 - en
Highlights from Game Off 2020 - en
How to launch a tech career in 2021 - it
Let’s talk about securing open source projects - tl
Git clone: a data-driven study on cloning behaviors - tl
Get up to speed with partial clone and shallow clone - it
GitHub joins amicus brief warning of systemic risk from private sector offensive actors - af
Visualizing GitHub’s global community - tl
How we built the GitHub globe - en
How to make DevOps your competitive advantage - pt

besides using composer install ... I have done anything

The text here is just an example, it's from github blog (title of the last 10 posts)

if I do new \LanguageDetector\LanguageDetector(null,['en']); it will work, but that is not the goal.

the code looks like this:

$languageDetector = new \LanguageDetector\LanguageDetector();

foreach($titles AS $title) {

    $languages = $languageDetector->evaluate($title)->getLanguage();

    echo $title.' - '.(string)$languages.PHP_EOL;
}

FabianoLothor commented 3 years ago

ward

Answer 1 · 2021-05-02T20:00:41.000Z

Looks like this suffers from the same thing as the more popular https://github.com/patrickschur/language-detection

It does a good job with long texts but is borderless useless for short sentences.. getting it wrong at an alarmingly high rate

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

Answer 2 · 2021-12-02T12:10:44.000Z

Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share

@vesper8 https://github.com/fntlnz/cld2-php-ext works good for my use-cases also with rather short texts. It detects all the above cases as English