This implementation is extremely fast and efficient, using both well-known and novel optimizations.
—R.J.
January 17, 2019
Does the world really need another Levenshtein edit distance function for MySQL? YES. The most popular versions floating around on the internet are very slow and so poorly written as to be dangerous. Do not use them!
FAQ
Functions
Usage
DAMLEV
DAMLEVLIM
DAMLEVP
DAMLEV2D
Limitations
Requirements
Preparation for Use
Acquiring prebuilt binaries
Building from source
Installing
Warning
Authors and License
Q: How is Damerau-Levenshtein edit distance different from Levenshtein edit distance?
A: Levenshtein edit distance allows for insertions, deletions, and substitutions. Damerau-Levenshtein edit distance allows transposition in addition to the other three operations. Many "Levenshtein" functions are actually Damerau-Levenshtein functions.
Q: What are the *LIM
versions of the functions?
A: An optimization in which the algorithm will only perform the computation until the provided limit is reached. Then it returns the limit. This can be much more efficient if you only care about edit distance less than some known upper bound, a typical case with databases.
Q: What are the *P
versions of the functions?
A: These functions return a normalized edit distance. The number returned is (edit distance)/(max string length), which is always a number in the interval [0, 1]
. One interpretation of this is that of a percentage match.
Function | Description |
---|---|
DAMLEV(STRING, STRING) |
Computes the Damerau-Levenshtein edit distance between two strings. |
DAMLEVP(STRING, STRING) |
Computes a normalized Damerau-Levenshtein edit distance between two strings. |
DAMLEVLIM(STRING, STRING, INT) |
Computes the Damerau-Levenshtein edit distance between two strings up to a given max distance. Providing a max can significantly increase efficiency. |
DAMLEV2D(STRING, STRING) |
Computes the Levenshtein edit distance between two strings using a two row approach with optimization of vector length based on string lenght. |
DAMLEVCONST(STRING, CONSTANT STRING, INT) |
Computes the Damerau-Levenshtein edit distance between a string and a constant string up to a given max distance. Significant efficiency can result from the assumption that the second argument is constant. |
SELECT DAMLEV(String1, String2);
Argument | Meaning |
---|---|
String1 |
A string |
String2 |
A string which will be compared to String1 . |
Returns | Either an integer equal to the edit distance between String1 and String2 or PosInt , whichever is smaller. |
Example Usage:
SELECT Name, DAMLEV(Name, "Vladimir Iosifovich Levenshtein") AS EditDist
FROM CUSTOMERS WHERE DAMLEV(Name, "Vladimir Iosifovich Levenshtein") < 15;
The above will return all rows (Name, EditDist)
from the CUSTOMERS
table
where Name
has edit distance within 15 of "Vladimir Iosifovich Levenshtein".
SELECT DAMLEVLIM(String1, String2, PosInt);
Argument | Meaning |
---|---|
String1 |
A string |
String2 |
A string which will be compared to String1 . |
Returns | An integer equal to the edit distance between String1 and String2 or PosInt , whichever is smaller. |
Example Usage:
SELECT Name, DAMLEVLIM(Name, "Vladimir Iosifovich Levenshtein", 6) AS EditDist
FROM CUSTOMERS WHERE DAMLEVLIM(Name, "Vladimir Iosifovich Levenshtein", 6) < 6;
The above will return all rows (Name, EditDist)
from the CUSTOMERS
table
where Name
has edit distance within 6 of "Vladimir Iosifovich Levenshtein".
DAMLEVP(String1, String2);
Argument | Meaning |
---|---|
String1 |
A string |
String2 |
A string which will be compared to String1 . |
Returns | A floating point number in the range [0, 1] equal to the normalized edit distance between String1 and String2 . This function is functionally equivalent to DAMLEV(String1, String2)/MAX(LENGTH(String1), LENGTH(String2)) but is much faster. |
SELECT Name, DAMLEVP(Name, "Vladimir Iosifovich Levenshtein") AS EditDist
FROM CUSTOMERS WHERE DAMLEVP(Name, "Vladimir Iosifovich Levenshtein") < 0.2;
The above will return all rows (Name, EditDist)
from the CUSTOMERS
table
where Name
has edit distance within 20% of "Vladimir Iosifovich Levenshtein".
DAMLEVCONST(String1, ConstString, PosInt);
Argument | Meaning |
---|---|
String1 |
A string |
ConstString |
A constant string (string literal) which will be compared to String1 . |
PosInt |
A positive integer. If the distance between String1 and ConstString is greater than PosInt , DAMLEVCONST() will stop its computation at PosInt and return PosInt . Make PosInt as small as you can to improve speed and efficiency. For example, if you put WHERE DAMLEVCONST(...) < k in a WHERE -clause, make PosInt be k . |
Returns | Either an integer equal to the edit distance between String1 and ConstString or PosInt , whichever is smaller. |
SELECT Name, DAMLEVCONST(Name, "Vladimir Iosifovich Levenshtein", 8) AS EditDist
FROM CUSTOMERS WHERE DAMLEV(Name, "Vladimir Iosifovich Levenshtein", 8) < 8;
The above will return all rows (Name, EditDist)
from the CUSTOMERS
table
where Name
has edit distance within 8 of "Vladimir Iosifovich Levenshtein".
This is a two row approach that is space optimized but does not calculate transpositions. You can read about this approach here.
https://takeuforward.org/data-structure/edit-distance-dp-33/
DAMLEV2D(String1, String2);
Argument | Meaning |
---|---|
String1 |
A string |
String2 |
A string which will be compared to String1 . |
Returns | Either an integer equal to the edit distance between String1 and String2 . |
SELECT Name, DAMLEV2D(Name, "Vladimir Iosifovich Levenshtein") AS EditDist
FROM CUSTOMERS WHERE DAMLEV(Name, "Vladimir Iosifovich Levenshtein") < 8;
The above will return all rows (Name, EditDist)
from the CUSTOMERS
table
where Name
has edit distance within 8 of "Vladimir Iosifovich Levenshtein".
- This implementation assumes characters are represented as 8 bit
char
's on your platform. If you are using UTF-8 codepoints above 255 (i.e. outside of UCS-2), this function will not compute the correct edit distance between your strings. - This function is case sensitive. If you need case insensitivity, you need to either compose this
function with
LOWER
/TOLOWER
, or adapt the code. - There are a few edge cases that that give an erroneous damlev2D, 1 less than the actual distance
- By default,
PosInt
has a default maximum of 512 for performance reasons. Removing the maximum entirely is not supported at this time, but you can increase the default by definingDAMLEV_BUFFER_SIZE
to be a larger number prior to compilation:
$ export DAMLEV_BUFFER_SIZE=10000
Any one of these limitations would be a good for a contributor to solve. Make a pull request!
- MySQL version 8.0 or greater. Or not. I'm not sure. That's just what I used.
- The headers for your version of MySQL.
- CMake. Who knows what minimum version is required, but it should work on even very old versions. I used version 3.13.
- A C++ compiler released in the last decade. This code uses
constexpr
, which is a feature of C++11.
This is probably the easiest and fastest way to get going. Get pre-built binaries on the Releases page. There are pre-built binaries for Linux, macOS, and Windows. Download the file and put it in your MySQL plugins directory. Then procede to the Installing section.
Certainly! Here's your Docker build instruction revised and formatted in Markdown:
The Docker configuration is set up to persist the build
directory. When you run the Docker container, the .so
file will be generated in this directory. It's crucial to ensure that the chip architecture of your Docker environment matches your host machine to ensure compatibility with the .so
file.
-
Build and Start Docker Container: Run the following command to build and start the Docker container. This command also triggers the building process of the
.so
file:docker-compose up --build
-
Monitor the Output: You may see some warning messages during the build process, typically related to type mismatches or other non-critical issues.
-
Build Completion: The build process is complete when you see an output similar to:
damlev_udf | [100%] Linking CXX executable unittest damlev_udf | [100%] Built target unittest
-
Check the
build
Directory: After the build is complete, the.so
file can be found in thebuild
directory on your host machine.
- The
docker-compose up --build
command both builds the Docker image and starts the container as per thedocker-compose.yml
file. - The build process executes inside the Docker container, but thanks to the configured volume mount, the output
.so
file is accessible on your host machine. - Ensure compatibility between the Docker environment and your host machine, especially in terms of architecture (e.g., x86_64, ARM), for the
.so
file to function correctly. - If you're not familiar with docker ask your favorite ChatBot
This Markdown format maintains the structure and style you initially provided, offering clear and easy-to-follow instructions for building from Docker.
The usual CMake build process with make
:
$ mkdir build
$ cd build
$ cmake ..
$ make
$ make install
Alternatively, the usual CMake build process with ninja
:
$ mkdir build
$ cd build
$ cmake .. -G Ninja
$ ninja
$ ninja install
This will build the shared library libdamlev.so
(.dll
on Windows).
You can pass in MYSQL_INCLUDE
and MYSQL_PLUGIN_DIR
to tell CMake where to find mysql.h
and where to install the plugin respectively. This is particularly helpful on Windows machines, which tend not to have mysql_config
in the PATH
:
$ cmake -DMYSQL_INCLUDE="C:\Program Files\MySQL\MySQL Server 8.0\include" -DMYSQL_PLUGIN_DIR="C:\Program Files\MySQL\MySQL Server 8.0\lib\plugin" ..
- The build script tries to find the required header files with
mysql_config --include
. Otherwise, it takes a wild guess. Check to see ifmysql_config --plugindir
works on your command line. - As in #1, the install script tries to find the plugins directory with
mysql_config --plugindir
. See if that works on the command line.
After building, install the shared library libdamlev.so
to the plugins directory of your MySQL
installation:
$ sudo make install # or ninja install
$ mysql -u root
Enter your MySQL root user password to log in as root. Then follow the "usual" instructions for
installing a compiled UDF. Note that the names are case sensitive. Change out .so
for .dll
if you are on Windows.
CREATE FUNCTION damlev RETURNS INTEGER
SONAME 'libdamlev.so';
CREATE FUNCTION damlevlim RETURNS INTEGER
SONAME 'libdamlev.so';
CREATE FUNCTION damlevconst RETURNS INTEGER
SONAME 'libdamlev.so';
CREATE FUNCTION damlevp RETURNS REAL
SONAME 'libdamlev.so';
CREATE FUNCTION damlev2D RETURNS REAL
SONAME 'libdamlev.so';
To uninstall:
DROP FUNCTION damlev;
DROP FUNCTION damlevlim;
DROP FUNCTION damlevp;
DROP FUNCTION damlev2D;
DROP FUNCTION damlevconst;
Then optionally remove the library file from the plugins directory:
$ rm /path/to/plugin/dir/libdamlev.so
You can ask MySQL for the plugin directory:
$ mysql_config --plugindir
/path/to/directory/
Warning: Do NOT use random code you found on the internet in a production
environment without auditing it carefully. Especially don't use anything called
damlevlim256
. Yeah, I googled for Levenshtein UDFs just like you did. I found
some really heinous code. Horrible, very bad code that will give your children
lupus.
Copyright (C) 2019 Robert Jacobson. Released under the MIT license.
Based on "Iosifovich", Copyright (C) 2019 Frederik Hertzum, which is licensed under the MIT license: https://bitbucket.org/clearer/iosifovich.
The MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.