LangChecker

LangChecker is the implementation in Java programming language of well known approach to determine wrong keyboard layout. Supported languages are Russian and English. Used approach, called n-gram, based on vocabularies with nonexistent combination of letters. Algorithm works as good as carefully vocabularies were built (test results with accuracy of algorithm you can find below at Tests section).

LangChecker implemented as tokenizer. Why? because some letters in Russian layout are separators in English layout (for example: ыендубьгышс --> style,music, cj,snbt --> событие). LangChecker able to check not only single word, but phrase.

Implementation has dependency on Immutables.org.

Usage

Tokenizer tokenizer = LangSwitcherTokenizer.create();
System.out.println(tokenizer.tokenize("hello word руддщ цщкв"));
System.out.println(tokenizer.tokenize("примет мир ghbdtn vbh"));

Result of tokenize(String input) method is instance of TokenizerResponse. It contains original phrase, corrected phrase and list of tokens(parts of the phrase that recognized as words).

Tests

This test shows how good algorithm can detect wrong or correct words. Vocabularies with 109582 english and 92453 russian words were used for tests.

	EN	RU
positive	99.97%	99.99%	amount of correct words, that were recognized as correct
false negative	0.03%	0.01%	amount of correct words, thar were recognized as wrong
negative	98.26%	97.81%	amount of wrong words, that were recognized as wrong
false positive	1.74%	2.19%	amount of wrong words, that were recognized as correct

correct words - words from vocabulary, wrong words - words from vocabulary in wrong keyboard layout

Licence

Apache License, Version 2.0

akimdi/langchecker

LangChecker

Usage

Tests

Licence