LangChecker is the implementation in Java programming language of well known approach to determine wrong keyboard layout. Supported languages are Russian and English. Used approach, called n-gram, based on vocabularies with nonexistent combination of letters. Algorithm works as good as carefully vocabularies were built (test results with accuracy of algorithm you can find below at Tests section).
LangChecker implemented as tokenizer. Why? because some letters in Russian layout are separators in English layout (for example: ыендубьгышс --> style,music, cj,snbt --> событие). LangChecker able to check not only single word, but phrase.
Implementation has dependency on Immutables.org.
Tokenizer tokenizer = LangSwitcherTokenizer.create();
System.out.println(tokenizer.tokenize("hello word руддщ цщкв"));
System.out.println(tokenizer.tokenize("примет мир ghbdtn vbh"));
Result of tokenize(String input)
method is instance of TokenizerResponse
.
It contains original phrase, corrected phrase and list of tokens(parts of the phrase that recognized as words).
This test shows how good algorithm can detect wrong or correct words. Vocabularies with 109582 english and 92453 russian words were used for tests.
EN | RU | ||
---|---|---|---|
positive | 99.97% | 99.99% | amount of correct words, that were recognized as correct |
false negative | 0.03% | 0.01% | amount of correct words, thar were recognized as wrong |
negative | 98.26% | 97.81% | amount of wrong words, that were recognized as wrong |
false positive | 1.74% | 2.19% | amount of wrong words, that were recognized as correct |
correct words - words from vocabulary, wrong words - words from vocabulary in wrong keyboard layout