Multi-language, Multi word, utf-8 spelling correction server for e.g. search engines using a levenshtein automaton and a Trie. Written in typescript.
- Is capable of splitting words, joining words, correcting phrases, etc
- Calculates distances according to damerau-levenshtein and scores multi character transliterations (german umlauts) with a distance of one.
- Applies a max edit distance per word.
- Uses several optimizations like re-using trie nodes when correcting phrases to achieve single digit millisecond response times most of the time.
- Applies multiple rules for choosing the best correction, including a user supplied score.
You need to install node (>= 14) and yarn and run:
yarn install
Afterwards you can build it via:
yarn build
Please run yarn serve --help
to get an overview of available command line
options. To start the server, which listens on port 8080 by default, run:
yarn serve --files /path/to/*.dic
The .dic
files are user-supplied tab separated text files:
en some phrase 3942
en keyword 3491
...
containing three columns per line. First column is an arbitrary language identifier, the second column is the keyword or phrase and the third column is a frequency or score value. If there are multiple matches with the same distance, the one with a higher score wins. Correction is done greedily, i.e. phrases are generally preferred.
Please note, spella lowercases the phrases from the dictionary but does not normalize them in any other way, as there is hardly any normalization that fits every possible use case. Therefore, you better normalize the phrases in the dictionary files yourself beforehand according to your needs.
The server listens on port 8080 and uses a simple JSON based protocol.
Request:
curl -X GET http://127.0.0.1:8080/corrections?language=en&text=some+phrse+and+keword
Response:
{
"text": "some phrase and keyword",
"distance": 2,
"score": 7433,
"took": 6,
"corrections": [
{
"original": "some phrse",
"text": "some phrase",
"distance": 1,
"score": 3942,
"found": true
},
{
"original": "and",
"text": "and",
"distance": 0,
"score": 0,
"found": false
},
{
"original": "keword",
"text": "keyword",
"distance": 1,
"score": 3491,
"found": true
}
]
}
where distance
is the damerau levenshtein distance value and took
tells you
how long the response took. As you see, spella returns the full correction and
details about every single correction.
Additionally, spella provides an /info
endpoint which provides version info
and which can e.g. be used for docker health checks:
curl -X GET http://localhost:8080/info
The criteria for choosing the best correction are:
- the number of words (higher is better)
- distance (smaller is better)
- whether or not a correction matches the original when transliterated
- the user supplied score (higher is better)
Currently, the default max allowed edit distances are:
- token length < 4 characters: won't be corrected
- token length < 9 characters: a maximum edit distance of 1 is used
- else: a maximum edit distance of 2 is used
You can change those using the --distances
command line option and pass a
comma separated list of string lenghts. For instance, --distance 3,6,9
means
- token length < 3 characters: won't be corrected
- token length < 6 characters: a maximum edit distance of 1 is used
- token length < 9 characters: a maximum edit distance of 2 is used
- else a maximum edit distance of 3 is used
It is strongly recommended to have an overall maximum edit distance of 2 for performance reasons.
Spella currently does not split words if the resulting phrase is not present in any dictionary file. For instance, the query "tabletennis", will only be corrected to "table tennis" if a dictionary file contains "table tennis". It is not enough to have "table" and "tennis" individually present in dictionary files. This was decided because the respective corrections can be low quality due to missing context.