optimaize/language-detector

getLanguages() and getShortTextLanguages() need documentation

Closed this issue · 2 comments

Hi!

This file contains two methods getLanguages() and getShortTextLanguages():
https://github.com/optimaize/language-detector/blob/master/src/main/java/com/optimaize/langdetect/profiles/BuiltInLanguages.java

What's the difference between a language and a short text language?

Updated Javadocs would be nice, plus an answer right here of course :).

Regards /Johan

The "short" part refers to the length of the text being analyzed -- @shuyo generated those profiles using Twitter text as training data, where tweets are limited to 140 characters. So it takes into account the style of text used in Twitter, with a lot of abbreviations and a minimal writing style. The regular profiles, on the other hand, were generated using text from Wikipedia abstracts.

Added Javadoc:

/**
 * Returns the languages for which the library provides full profiles.
 * Full provides are generated from regular text, usually Wikipedia abstracts.
 * @return immutable
 */
public static List<LdLocale> getLanguages() {
    return languages;
}

/**
 * Returns the languages for which the library provides profiles created from short text.
 * Twitter was used as source by @shuyo.
 * Much less languages have short text profiles as of now.
 * @return immutable
 */
public static List<String> getShortTextLanguages() {
    return shortTextLanguages;
}