Approximate matching and string distance calculations for R.
Current CRAN version: 0.5.0
String distance functions are scattered around R, and R's packages. Moreover, approximate string matching functions are scarce. The package offers four main functions:
stringdist
computes pairwise distances between two input character vectors (shorter one is recycled)stringdistmatrix
computes the distance matrix between two input character vectors, optionally running in parallel.amatch
is a fuzzy matching equivalent of R's nativematch
functionain
is a fuzzy matching equivalent of R's native%in%
operator
These functions are built upon C
-code that re-implements some common (weighted) string
distance functions. As of version >0.5
, distance functions include:
- Hamming distance;
- Levenshtein distance (weighted);
- Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
- Full Damerau-Levenshtein distance (weighted);
- Longest Common Substring distance;
- Q-gram distance (two implementations using different q-gram storage solutions; currently one is exposed).
- cosine distance for q-gram count vectors (= 1-cosine similarity)
- Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
- Jaro, and Jaro-Winker distance
To my best knowledge, the latter six were not available before in R.
Besides the above the function qgrams
tabulates the qgrams in a charcter
vector.
Up to version <=0.5
, stringdist
returned -1
to indicate that either
maxDist
is exceeded, or- the distance is undefined between two input strings
From version >=0.6
, this will be replaced by Inf
. The main reason:
- This allows easier comparison using
<
and<=
The old convention stemmed from the fact that some distances (e.g. q-gram, hamming) are strictly
integers. I'm abandoning this expressing all distances as double
(R numeric
) which allows me
to use Inf
.
To install the latest release from CRAN, open an R terminal and type
install.packages('stringdist')
To obtain the package from source code open a bash
terminal (or git bash
if you work under Windows
with msysgit
) and type
$> git clone https://github.com/markvanderloo/stringdist.git
$> cd stringdist
$> bash ./build.bash
$> R CMD INSTALL output/stringdist_*.tar.gz
Warning: the github version can change any time and may not even build properly. As most
of the code is written in C
, the development version may crash your R
-session.