/moman

Forever incomplete suite of tools for an orthographic/grammatical checker

Primary LanguagePythonMIT LicenseMIT

Moman

Description

This was supposed to be a suite of tools to be used by an orthographic/grammatical checker and the checker itself. However, the project is mainly dead right now. But I encourage you to look through the code and use it as inspiration/reference. The tools are currently coded in Python, but I started a while back to rewrite it in Lisp (which will never be finished). Moman, the suite itself, consist of the following tools:

  • FineNight is the FSA library.
  • A FST library. (Not yet implemented)
  • ZSpell is the orthographic checker.

Mostly, the only part of the tools suite which is worthwhile mentioning is the "Fast String Correction" which is used by Lucene's FuzzyQuery. You can read about the inclusion of this project in Lucene by reading Michael McCandless's article.

FineNight

The FineNight library contains many algorithms for Finite State Automatons. That includes:

  • Union of two FSAs
  • Intersection of two FSAs
  • Complement of a FSAs
  • Difference of two FSAs
  • Reversal of a FSA
  • Closure of a FSA
  • Concatenation of two FSAs
  • Determination of a NFA
  • Equivalence test
  • Minimization algorithm
  • Construction of an IADFA from a sorted dictionary
  • Graphviz support
  • Error-Tolerant IADFA (starred in Michael McCandless's article

Almost all algorithms were taken from the book Introduction to Automata Theory, Languages, and Computation. The minimization algorithm is an implementation of Brzozowski's method. In this method, the (possibly non-deterministic) automaton is reversed, determinized, reversed and determinized. I'll eventually add the Hopcroft's nlog(n) minimization algorithm.

ZSpell

ZSpell is meant to be a concurrent of aspell, made by Kevin Atkinson. At this time, ZSpell can suggest words with a Levenshtein-distance of one. Before we were using Kemal Oflazer's algorithm. This algorithm is very slow, but now we use a faster algorithm (Schulz's and Mihov's algorithm). However, only substitution, removal and insertion are used for the faster algorithm. It means that transpositions errors, like "ehllo" -> "hello", are considered as two operations.

TODOs includes:

  • Add transposition errors for Levenshtein-distance algorithm.
  • Add phonetic errors (spelling by sound).
  • Add derivation errors.

References