An Elasticsearch plugin for scoring documents based on string similarity
The Elasticsearch plugin relies on the https://github.com/tdebatty/java-string-similarity library. The library is fullyopen source and publicly hosted on Github under the MIT licence.
The plugin currently supports these algorithms:
- Cosine similarity (cosine)
- Jaccard index (jaccard)
- Jaro-Winkler (jaro-winkler)
- Longest Common Subsequence (longest-common-subsequence)
- Normalized Levenshtein (levenshtein)
- Sorensen-Dice coefficient (dice)
The plugin can be built with Java 13 with the following command:
./gradlew build
The plugin installation may be installed using the standard Elasticsearch installation procedure.
elasticsearch-plugin install file://path-to-plugin-zip-file
systemctl restart elasticesearch
Replace path-to-plugin-zip-file with the correct path to the plugin installation zip file.
Just like the old plugin, the new plugin may be tested by submitting to an Elasticsearch server a JSON formatted query of the following form.
curl -X POST "localhost:9200/patients/_search?pretty=true" -H
'Content-Type: application/json' -d'{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"script_score": {
"script": {
"source": "string_similarity",
"lang" : "similarity_scripts",
"params": {
"matchers": [{
"field": "given",
"value": "Alis",
"matcher": "jaro-winkler",
"high": 0.9,
"low": 0.1
}]
}
}
}
}
]
}
}
}'
The matchers key contains an array of all fields to be searched, configured with the appropriate field name, value, algorithm and high and low values.
Parameter | Description |
---|---|
field | The field to be searched e.g. “ given ”. |
value | The search term e.g. “ Alis ”. |
matcher | The algorithm to use for matching e.g. “ jaro-winkler ”. |
high | The score to be assigned to a string that matches the search term perfectly. |
low | The score to be assigned to a string that does not match the search term at all. |