/spellwise

🚀 Extremely fast fuzzy spelling checker and suggester in Python!

Primary LanguagePythonMIT LicenseMIT

Spellwise

🚀 Extremely fast spelling checker and suggester in Python!

PyPI - Python Version PyPI version Downloads PyPI - Wheel License: MIT Code style: black

The following algorithms are supported currently,

All the above algorithms use an underlying Trie based dictionary for efficient storage and fast computation! Implementations of all the algorithms are inspired by the amazing article Fast and Easy Levenshtein distance using a Trie, by Steve Hanov.

📦 Installation

The easiest way to install spellwise is through pip.

pip install spellwise

🧑‍💻 Usage

Currently, there are five algorithms available for use with the following classnames,

  • Levenshtein
  • Editex
  • Soundex
  • CaverphoneOne
  • CaverphoneTwo
  • Typox

Please check the examples/ folder for specific usage of each algorithm. But in a general sense, each algorithm has three parts,

  • Initialization (initialize the class object for the algorithm to use)
  • Index correct words/names (add correct words or names to the dictionary)
  • Fetch suggestions (inference)
from spellwise import (CaverphoneOne, CaverphoneTwo, Editex, Levenshtein,
                       Soundex, Typox)

# (1) Initialize the desired algorithm
algorithm = Editex() # this can be CaverphoneOne, CaverphoneTwo, Levenshtein or Typox as well

# (2) Index the words/names to the algorithm
# Indexing can be done by adding words from a file
algorithm.add_from_path("<path-to-the-dictionary-file>")
# or by adding them manually
algorithm.add_words(["spell", "spelling", "check"])

# (3) Fetch the suggestions for the word
suggestions = algorithm.get_suggestions("spellin")
# The `suggestions` is a list of dict with fields `word` and `distance`
# [{"word": ..., "distance": ...}, ...]
print(suggestions)

# Output would be similar to the following,
# [{'word': 'spelling', 'distance': 2}]

The default maximum distance considered varies for different algorithms. It can be changed while fetching the suggestions,

# Fetch suggestions with maximum distance 4
suggestions = algorithm.get_suggestions("spellin", max_distance=4)
# Print the suggestions
print(suggestions)

# Output would be similar to the following,
# [{'word': 'spelling', 'distance': 2}, {'word': 'spell', 'distance': 4}]

💡 Analysis of each algorithm

There are many algorithms currently available in the package, each suitable for different purposes. We will discuss each algorithm in specific in the following sections.

(1) Levenshtein

The Levenshtein algorithm is the baseline and most popular method to identify the closest correct words given the misspelt word, based on the edit-distance (number of insertions, deletions and replacements) between the given word and the correctly spelt word.

from spellwise import Levenshtein

# Initialise the algorithm
levenshtein = Levenshtein()
# Index the words from a dictionary
levenshtein.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = levenshtein.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Levenshtein provides the following,

Word 	 Distance
=================
run 	 0
bun 	 1
dun 	 1
fun 	 1
gun 	 1
hun 	 1
jun 	 1
jun 	 1
mun 	 1
nun 	 1

(2) Editex

The Editex algorithm provides suggestions of words which are phonetically closed to the given word. It also uses the edit-distance but has a different replacement or deletion costs depending on whether the two letters belong to the same phonetic group or not.

from spellwise import Editex

# Initialise the algorithm
editex = Editex()
# Index the words from a dictionary
editex.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = editex.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Editex suggests the following,

Word 	 Distance
=================
run 	 0
ran 	 1
ron 	 1
ruin 	 1
rum 	 1
bun 	 2
dun 	 2
dunn 	 2
fun 	 2
gun 	 2

Notice that the Levenshtein algorithm computes the distance between run and bun as 1 (since there is only one replacement necessary). On the other hand, Editex algorithm computes this distance as 2 since phonetically, the words are farther apart.

As mentioned above, the Editex algorithm uses different costs for replacement and deletion. These values can be modified for fetching different results.

from spellwise import Editex

# Initialise the algorithm
editex = Editex(group_cost=0.5, non_group_cost=3) # configure the group cost and non-group cost
# Index the words from a dictionary
editex.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = editex.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Configuring group_cost=0.5 and non_group_cost=3 in the above example results in the following suggestions,

Word 	 Distance
=================
run 	 0
ran 	 0.5
ron 	 0.5
ruin 	 0.5
rum 	 0.5
lan 	 1.0
len 	 1.0
lin 	 1.0
lon 	 1.0
loon 	 1.0

(3) Soundex

The Soundex algorithm, similar to Editex aims to provide phonetically similar words for the give word. It is one of the initial phonetic matching algorithms and many variations exists.

from spellwise import Soundex

# Initialise the algorithm
soundex = Soundex()
# Index the words from a dictionary
soundex.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = soundex.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Soundex suggests the following,

Word 	 Distance
=================
rain 	 0
rainy 	 0
ram 	 0
ram 	 0
rama 	 0
ramie 	 0
ran 	 0
ranee 	 0
rayon 	 0
ream 	 0

(4) Caverphone 1.0 and Caverphone 2.0

The Caverphone algorithm was developed as a part of the Caversham project to phonetically identify the names of different instances of the same person from various sources. In other words, it is used for phonetically identifying duplicate entries of an entity or a word. The difference between the v1 and v2 of the algorithm is in the pre-processing of words during indexing.

from spellwise import CaverphoneTwo # or CaverphoneOne

# Initialise the algorithm
caverphone = CaverphoneTwo()
# Index the words from a dictionary
caverphone.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = caverphone.get_suggestions("run")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Caverphone v2 provides the following suggestions,

Word 	 Distance
=================
rain 	 0
ran 	 0
rein 	 0
rene 	 0
roan 	 0
ron 	 0
ruin 	 0
run 	 0
rune 	 0
wren 	 0

(5) Typox

The Typox is a Typographic based correction algorithm optimised for correcting typos in QWERTY keyboard. This is similar to the Editex algorithm, except that the letters are grouped based on their locations on the keyboard, instead of grouping them phonetically. The original paper is not available to read for free, and hence this might not be its exact implementation.

from spellwise import Typox

# Initialise the algorithm
typox = Typox()
# Index the words from a dictionary
typox.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = typox.get_suggestions("ohomr")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Typox provides the following words,

Word 	 Distance
=================
home 	 2
phone 	 2

Notice that Typox did not suggest words like choke, come, chore, chose etc., (which Levenshtein would suggest) even though they are of edit-distance 2 with the word ohome. But it rather suggests closest words based on the QWERTY keyboard layout which are phone and home.

As mentioned above, the Typox algorithm is similar to Editex and uses different costs for replacement and deletion. These values can be modified for fetching different results.

from spellwise import Typox

# Initialise the algorithm
typox = Typox(group_cost=0.5, non_group_cost=3) # configure the group cost and non-group cost
# Index the words from a dictionary
typox.add_from_path("examples/data/american-english")

# Fetch suggestions
suggestions = typox.get_suggestions("ohomr")
# Print the top 10 suggestions
print("Word \t Distance")
print("=================")
for suggestion in suggestions[0:10]:
    print("{} \t {}".format(suggestion.get("word"), suggestion.get("distance")))

Typox provides the following suggestion for the word ohomr after setting the group_cost=0.5 and non_group_cost=3.

Word 	 Distance
=================
phone 	 1.5
phoned 	 2.0
phones 	 2.0

⚡️ Memory and Time profiling

The following are the usage statistics on a MacBook Pro, 2.4 GHz Quad-Core Intel Core i5 with 16 GB RAM.

Algorithm No. of words Corpus size on disk Memory used Time to index Time to inference Remarks
Levenshtein 119,095 1.1 MB ~ 127 MB ~ 1160 milliseconds ~ 36 milliseconds
  • For word "hallo"
  • With max distance 2
Editex 119,095 1.1 MB ~ 127 MB ~ 1200 milliseconds ~ 90 milliseconds
  • For word "hallo"
  • With max distance 2
Soundex 119,095 1.1 MB ~ 16 MB ~ 1130 milliseconds ~ 0.18 milliseconds (yes right!)
  • For word "hallo"
Caverphone 1.0 119,095 1.1 MB ~ 36.7 MB ~ 1700 milliseconds ~ 0.2 milliseconds (yes right!)
  • For word "hallo"
Caverphone 2.0 119,095 1.1 MB ~ 99 MB ~ 2400 milliseconds ~ 0.4 milliseconds (yes right!)
  • For word "hallo"
Typox 119,095 1.1 MB ~ 127 MB ~ 1360 milliseconds ~ 200 milliseconds
  • For word "hallo"
  • With max distance 2

🙌 Contributing

Please feel free to raise PRs! 😃

There are so many algorithms to be added and improvements to be made to this package. This package is still in an early version and would love to have your contributions!

📝 References