/word-checker

🇨🇳🇬🇧Chinese and English word spelling corrector.(中文易错别字检测,中文拼写检测纠正。英文单词拼写校验工具)

Primary LanguageJavaApache License 2.0Apache-2.0

Project Description

中文文档

This item is used for word spell checking.

Support English word spelling detection, and Chinese spelling detection.

Maven Central Build Status Coverage Status Open Source Love

Feature description

Support English word correction

  • 1000X faster spelling correction & fuzzy search through Symmetric Delete spelling correction algorithm

  • You can quickly determine whether the current word is spelled incorrectly

  • Can return the best match result

  • You can return to the corrected matching list, support specifying the size of the returned list

  • Error message support i18n

  • Support uppercase and lowercase, full-width and half-width formatting

  • Support custom thesaurus

Support basic Chinese spelling check

Change log

Change Log

Quick start

JDK version

Jdk 1.7+

maven introduction

<dependency>
     <groupId>com.github.houbb</groupId>
     <artifactId>word-checker</artifactId>
    <version>0.1.0</version>
</dependency>

Test Case

According to the input, the best correction result is automatically returned.

final String speling = "speling";
Assert.assertEquals("selling", EnWordCheckers.correct(speling));

Core api introduction

The core api is under the EnWordCheckers tool class.

Function Method Parameters Return Value Remarks
Determine whether the spelling of the word is correct isCorrect(string) The word to be detected boolean
Return the best corrected result correct(string) The word to be detected String If no word that can be corrected is found, then return itself
Determine whether the spelling of the word is correct correctList(string) The word to be detected List Return a list of all matching corrections
Determine whether the spelling of the word is correct correctList(string, int limit) The word to be detected, the size of the returned list Return the corrected list of the specified size List size <= limit

Test example

See EnWordCheckerTest.java

Is the spelling correct?

final String hello = "hello";
final String speling = "speling";
Assert.assertTrue(EnWordCheckers.isCorrect(hello));
Assert.assertFalse(EnWordCheckers.isCorrect(speling));

Return the best match result

final String hello = "hello";
final String speling = "speling";
Assert.assertEquals("hello", EnWordCheckers.correct(hello));
Assert.assertEquals("selling", EnWordCheckers.correct(speling));

Corrected the match list by default

final String word = "goo";
List<String> stringList = EnWordCheckers.correctList(word);
Assert.assertEquals("[good, goo, goon, goof, gobo, gook, goop]", stringList.toString());

Specify the size of the corrected match list

final String word = "goo";
final int limit = 2;
List<String> stringList = EnWordCheckers.correctList(word, limit);
Assert.assertEquals("[go, good]", stringList.toString());

Chinese spelling correction

Core api

In order to reduce learning costs, the core api and ZhWordCheckers are consistent with English spelling detection.

Is the spelling correct?

final String right = "正确";
final String error = "万变不离其中";

Assert.assertTrue(ZhWordCheckers.isCorrect(right));
Assert.assertFalse(ZhWordCheckers.isCorrect(error));

Return the best match result

final String right = "正确";
final String error = "万变不离其中";

Assert.assertEquals("正确", ZhWordCheckers.correct(right));
Assert.assertEquals("万变不离其宗", ZhWordCheckers.correct(error));

Corrected the match list by default

final String word = "万变不离其中";

List<String> stringList = ZhWordCheckers.correctList(word);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

Specify the size of the corrected match list

final String word = "万变不离其中";
final int limit = 1;

List<String> stringList = ZhWordCheckers.correctList(word, limit);
Assert.assertEquals("[万变不离其宗]", stringList.toString());

Formatting

Sometimes the user's input is various, this tool supports the processing of formatting.

Case

Uppercase will be uniformly formatted as lowercase.

final String word = "stRing";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Full-width half-width

Full-width will be uniformly formatted as half-width.

final String word = "string";

Assert.assertTrue(EnWordCheckers.isCorrect(word));

Custom English Thesaurus

File configuration

You can create the file resources/data/define_word_checker_en.txt in the project resource directory

The content is as follows:

my-long-long-define-word,2
my-long-long-define-word-two

Different words are on their own lines.

The first column of each row represents the word, and the second column represents the number of occurrences, separated by a comma ,.

The greater the number of times, the higher the return priority when correcting. The default value is 1.

User-defined thesaurus has a higher priority than the built-in thesaurus of the system.

Test code

After we specify the corresponding word, the spelling check will take effect.

final String word = "my-long-long-define-word";
final String word2 = "my-long-long-define-word-two";

Assert.assertTrue(EnWordCheckers.isCorrect(word));
Assert.assertTrue(EnWordCheckers.isCorrect(word2));

Custom Chinese Thesaurus

File configuration

You can create the file resources/data/define_word_checker_zh.txt in the project resource directory

The content is as follows:

默守成规 墨守成规

Use English spaces to separate, the front is wrong, and the back is correct.

Long text mixed in Chinese and English

Condition

The actual spelling of the story, the best user experience is a long text entered by the user, and it may be a mixture of Chinese and English.

Then realize the corresponding functions mentioned above.

Core method

The WordCheckers tool class provides the automatic function of mixing Chinese and English long texts.

Function Method Parameters Return Value Remarks
Determine whether the spelling of the word is correct isCorrect(string) The word to be detected boolean
Return the best corrected result correct(string) The word to be detected String If no word that can be corrected is found, then return itself
Determine whether the spelling of the text is correct correctMap(string) The text to be detected Map<String, List<String>> Return a list of all matching corrections
Determine whether the spelling of the text is correct correctMap(string, int limit) The text to be detected, the size of the returned list Return the corrected list of the specified size List size <= limit

Is the spelling correct?

final String hello = "hello 你好";
final String speling = "speling 你好 以毒功毒";
Assert.assertTrue(WordCheckers.isCorrect(hello));
Assert.assertFalse(WordCheckers.isCorrect(speling));

Return the best corrected result

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("hello 你好", WordCheckers.correct(hello));
Assert.assertEquals("selling 你好以毒攻毒", WordCheckers.correct(speling));

Determine whether the spelling of the text is correct

Each word corresponds to the correction result.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";
Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello).toString());
Assert.assertEquals("{ =[ ], speling=[selling, spewing, sperling, seeling, spieling, spiling, speeling, speiling, spelding], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling).toString());

Determine whether the spelling of the text is correct

Same as above, specify the maximum number of returns.

final String hello = "hello 你好";
final String speling = "speling 你好以毒功毒";

Assert.assertEquals("{hello=[hello],  =[ ], 你=[你], 好=[好]}", WordCheckers.correctMap(hello, 2).toString());
Assert.assertEquals("{ =[ ], speling=[selling, spewing], 你=[你], 好=[好], 以毒功毒=[以毒攻毒]}", WordCheckers.correctMap(speling, 2).toString());

NLP 开源矩阵

pinyin 汉字转拼音

pinyin2hanzi 拼音转汉字

segment 高性能中文分词

opencc4j 中文繁简体转换

nlp-hanzi-similar 汉字相似度

word-checker 拼写检测

sensitive-word 敏感词

Late Road-Map

  • Support English word segmentation and process the entire English sentence

  • Support Chinese word segmentation spelling detection

  • Introduce Chinese error correction algorithm, homophone characters and similar characters processing.

  • Support Chinese and English mixed spelling detection

Technical Acknowledgements

Words provides raw English word data.