This project offers a fast and efficient and efficient fuzzy search on large dictionary. To (try to) offer a good level of performances, the two following approaches are used:
- The dictionary is stored as a Trie. So, if the search does not match, it is easy to cut directly a full branch.
- The matching is done by using a Levenshtein automaton which allow to easily propagate the state of the automaton as the Trie is walked
The automaton is fully described by Jules Jacobs in this article. So please be sure to give him a thumb.
The first step of the search is the creation of a Dictionary
structure that is holding all the words. The structure is
created simply by calling the function levenshteinsearch.CreateDictionary()
without any specific parameter.
Example
dict := levenshteinsearch.CreateDictionary()
Words are added one by one to the dictionary, using its member function Put().
Example
// Add an array of word to the dictionary
for _, word := range allWords {
dict.Put(word)
}
Once initialized, the dictionary has two properties WordCount
and UniqueWordCount
, giving information about its content
Example
// Get information about the dictionary
log.Printf("Number of words in the dictionary: %v", dict.WordCount)
log.Printf("Number of unique words in the dictionary: %v", dict.UniqueWordCount)
Information about a single word can be retrieved using the function Get
of the dictionary. The function will return a
pointer to a WordInformation
structure. Currently this structure only have a single information, named Count
which
represents the number of time the word was put in the dictionary. Please, note that Get
will return nil if the
requested word is not present in the dictionary.
Example
wordInformation := dict.Get("rabbit")
if wordInformation!=nil {
log.Printf("Number of times the word 'rabbit' is present: %v", wordInformation.Count)
} else {
log.Printf("The word 'rabbit' is not part of the dictionary")
}
The dictionary also allow to query for similar words. The similarity is given by the Levenshtein distance Wikipedia.
For searching the similar words, the dictionary has a function named SearchAll()
, that takes in parameters the searched
word and the maximum distance. The function return a map[string]*WordInformation
. The returned map has as key the found
similar word and as value the WordInformation
structure of this word.
// Search all word having maximum Levenshtein distance of 3
wordInformationByWord := dict.SearchAll("rabbit", 3)
// Display the number of similar words found
log.Printf("Number of words close to \"rabbit\" with a distance of %v: %v", distance, len(wordInformationByWord))
// Print all the similar words
for key,value := range wordInformationByWord {
log.Printf("\tWord: '%v' count: %v", key,value.Count)
}
A full working example is given in the folder /example/alice/alice.go
.
The result was benched against:
- a very naive: Dictionary is stored as a simple list of strings. For each query, all string of the list is tested
- a simple map: Dictionary is stored as a map having as key the word and as value the
*WordInformation
structure. Due to the map redundant word are not tested multiple times
The search is done with text of Alice's Adventures In Wonderland. For the tests the sets of words <rabbit> and <rabbit, eart, the> are used. Results as follow:
go version go1.12 linux/amd64
goos: linux
goarch: amd64
pkg: github.com/twuillemin/levenshteinsearch/pkg/levenshteinsearch
BenchmarkNaive1Word-8 30 49067618 ns/op 18398114 B/op 278685 allocs/op
BenchmarkNaive3Word-8 10 129002968 ns/op 43198313 B/op 842237 allocs/op
BenchmarkMap1Word-8 200 9906184 ns/op 3600539 B/op 36129 allocs/op
BenchmarkMap3Word-8 50 27408442 ns/op 9776948 B/op 108599 allocs/op
BenchmarkOptimized1Word-8 500000 2493 ns/op 2368 B/op 50 allocs/op
BenchmarkOptimized3Word-8 200000 9378 ns/op 7104 B/op 150 allocs/op
PASS
Copyright 2018 Thomas Wuillemin thomas.wuillemin@gmail.com
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this project or its content except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.