optimaize/language-detector

Not able to detect English and Chinese?

Closed this issue · 2 comments

I happened to find this powerful tool, then did a simple try.
Unfortunately it's not working out well.
Below are related code snippet, can anyone tell why?

    <dependency>
        <groupId>com.optimaize.languagedetector</groupId>
        <artifactId>language-detector</artifactId>
        <version>0.5</version>
    </dependency>

public class LanuageDetector {
//load all languages:
static List languageProfiles;
static {
try {
languageProfiles = new LanguageProfileReader().readAllBuiltIn();
} catch (IOException e) {
log.error("Exception when loading language profile", e);
}
}

//build language detector:
static LanguageDetector languageDetector = LanguageDetectorBuilder
    .create(NgramExtractors.standard())
    .withProfiles(languageProfiles)
    .build();

//create a text object factory
static TextObjectFactory textObjectFactory = CommonTextObjectFactories.forDetectingOnLargeText();

public static String detectLang(String text) {
    TextObject textObject = textObjectFactory.forText(text);
    com.google.common.base.Optional<LdLocale> lang = languageDetector.detect(textObject);
    LdLocale locale = lang.orNull();
    return locale == null ? null : locale.getLanguage();
}
public static void main(String[] args) {
    String english = "I am English";
    String chinese = "我是简体中文";
    String hindi = "मैं हिन्दी हूं";
    System.out.println(detectLang(english));
    System.out.println(detectLang(chinese));
    System.out.println(detectLang(hindi));
}

}

Hi does the problem solved?

These are very short text samples. It seems to work best with texts longer than 200 characters. Sometimes you can get good results even with short pieces but that's not guaranteed on scale.