/stringdist

String metrics function in golang (levenshtein, damerau-levenshtein, jaro, jaro-winkler and additionally bk-tree) for autocorrect

Primary LanguageGoApache License 2.0Apache-2.0

stringdist

stringdist package contains several string metrics for calculating edit distance between two different strings. This includes the Levenshtein Distance, Damerau Levenshtein (both Optimal String Alignment, OSA and true damerau levenshtein), Jaro, Jaro Winkler and additionally a BK-Tree that can be used for autocorrect.

Algorithms

  • Levenshtein: A string metric for measuring the difference between two sequence. Done by computing the minimum number of single-edit character edit (insertion, substitution and deletion) required to change from one word to another.
  • Damerau-Levenshteim: similar to Levenshtein, but allows transposition of two adjacent characters. Can be computed with two different algorithm - Optimal String Alignment, (OSA) and true damerau-levenshtein. The assumption for ASA is taht no substring is edited more than once.
  • Jaro: Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.
  • Jaro-Winkler: Similar to Jaro, but uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length.
  • BK-Tree: A tree data structure specialized to index data in a metric space. Can be used for approximate string matching in a dictionary.

Other algorithms to explore:

  • Sift3/4 algorithm
  • Soundex
  • Metaphone
  • Hamming Distance
  • Symspell
  • Linspell

Thoughts

  • Autocorrect can be implemented using any of the distance metrics (such as levenshtein) with BK-Tree
  • Distance metric can be supplied to bk-tree through an interface.
  • Dictionary words can first be supplied to the tree, and subsequent words can be added later through other means (syncing, streaming, pub-sub)
  • The tree can be snapshotted periodically to avoid rebuild (e.g. using gob), test should be conducted to see if rebuilding the tree is faster than reloading the whole tree.
  • Build tree through prefix (A-Z) would result in better performance (?). How to avoid hotspots (more characters in A than Z)?
  • Can part of the tree be transmitted through the network?
  • How to blacklist words that are not supposed to be searchable? (profanity words)

References