wrong language detection
FLasH3r opened this issue · 3 comments
I have the following text with the corresponding language as detected by this package (all English)
Only the bold text is correct.
- Announcing the GitHub Education Classroom Report 2020 - en
- Highlights from Game Off 2020 - en
- How to launch a tech career in 2021 - it
- Let’s talk about securing open source projects - tl
- Git clone: a data-driven study on cloning behaviors - tl
- Get up to speed with partial clone and shallow clone - it
- GitHub joins amicus brief warning of systemic risk from private sector offensive actors - af
- Visualizing GitHub’s global community - tl
- How we built the GitHub globe - en
- How to make DevOps your competitive advantage - pt
besides using composer install ...
I have done anything
The text here is just an example, it's from github blog (title of the last 10 posts)
if I do new \LanguageDetector\LanguageDetector(null,['en']);
it will work, but that is not the goal.
the code looks like this:
$languageDetector = new \LanguageDetector\LanguageDetector();
foreach($titles AS $title) {
$languages = $languageDetector->evaluate($title)->getLanguage();
echo $title.' - '.(string)$languages.PHP_EOL;
}
Looks like this suffers from the same thing as the more popular https://github.com/patrickschur/language-detection
It does a good job with long texts but is borderless useless for short sentences.. getting it wrong at an alarmingly high rate
Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share
ward
Still looking for a reliable language detector that works well with short sentences in case anyone finds one please share
@vesper8 https://github.com/fntlnz/cld2-php-ext works good for my use-cases also with rather short texts. It detects all the above cases as English