This Julia package computes various distances between AbstractStrings
The package is registered in the General
registry and so can be installed at the REPL with ] add StringDistances
.
The function evaluate
returns the distance between two strings. Its syntax is:
evaluate(dist, s1, s2)
where s1
and s2
can be any iterator with a length
method (e.g. AbstractString
, GraphemeIterator
, AbstractVector
...), and dist
is one of the following distances:
-
Edit Distances
- Jaro Distance
Jaro()
- Levenshtein Distance
Levenshtein()
- Damerau-Levenshtein Distance
DamerauLevenshtein()
- RatcliffObershelp Distance
RatcliffObershelp()
- Jaro Distance
-
Q-gram distances compare the set of all substrings of length
q
in each string.- QGram Distance
Qgram(q::Int)
- Cosine Distance
Cosine(q::Int)
- Jaccard Distance
Jaccard(q::Int)
- Overlap Distance
Overlap(q::Int)
- Sorensen-Dice Distance
SorensenDice(q::Int)
- QGram Distance
-
The package includes distance "modifiers", that can be applied to any distance.
- Winkler boosts the similary score of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but this package defines it for any string distance.
- Partial returns the maximal similarity score between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
Partial
,TokenSort
andTokenSet
modifiers, with penalty terms depending on string lengths.
Some examples:
evaluate(Jaro(), "martha", "marhta")
evaluate(Winkler(Jaro()), "martha", "marhta")
evaluate(QGram(2), "martha", "marhta")
evaluate(Winkler(QGram(2)), "martha", "marhta")
evaluate(Levenshtein(), "martha", "marhta")
evaluate(Partial(Levenshtein()), "martha", "marhta")
evaluate(Jaro(), "martha", "marhta")
evaluate(TokenSet(Jaro()), "martha", "marhta")
evaluate(TokenMax(RatcliffObershelp()), "martha", "marhta")
Alternatively, each dist
struct can be used as a callable to call the evaluate function of each metric or modified metric, for example:
Jaro()("martha", "marhta")
Winkler(Jaro())("martha", "marhta")
QGram(2)("martha", "marhta")
A good distance to match strings composed of multiple words (like addresses) is TokenMax(Levenshtein())
(see fuzzywuzzy).
The function compare
is defined as 1 minus the normalized distance between two strings. It always returns a number between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.
evaluate(Levenshtein(), "New York", "New York")
#> 0
compare("New York", "New York", Levenshtein())
#> 1.0
-
findmax
returns the value and index of the element initr
with the highest similarity score withs
. Its syntax is:findmax(s, itr, dist::StringDistance; min_score = 0.0)
-
findall
returns the indices of all elements initr
with a similarity score withs
higher than a minimum value (default to 0.8). Its syntax is:findall(s, itr, dist::StringDistance; min_score = 0.8)
The functions findmax
and findall
are particularly optimized for Levenshtein
and DamerauLevenshtein
distances (as well as their modifications via Partial
, TokenSort
, TokenSet
, or TokenMax
).
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo
- fuzzywuzzy