The Cleo search is explained here: Linked in original article
The source for Jingwei Wu's version can be found here: Jingwei's version
Basically, this is a golang version of the original program. The original program is written in Java. I have included a corpus of words to search for. I downloaded this corpus from http://www.wordfrequency.info/
- The algorithm starts out by searching for matches in the inverted index. The inverted index contains a map of the word's prefix (up to 4 chars). Each word prefix maps to an array of document ID, bloom filter tuples.
- The bloom filter of each candidate is compared against the query's bloom filter. If it matches successfully, the candidate makes it to the next round.
- The remaining words are scored by their levenshtein distance to the query, then normalized using the Jaccard coefficient.
- The final words are returned as JSON
- You can also change how scoring works if you like. You just need to provide a function that conforms to func(s1, s2 string) (score float64)
This is a sample app:
package main
import "github.com/jamra/gocleo"
func main(){
cleo.InitAndRun("w1_fixed.txt", "8080", nil) //The last parameter is optional. Defaults to Levenshtein distance normalized by Jaccard coefficient
}
Run the program and navigate to localhost:8080/cleo/{query}
{query} is your search. e.g.("tractor", "nightingale", "pizza")
You can have the search run off of your own corpus so long as each term is separated by a new line. w1_fixed.txt is provided as an example.
This should work with go get
go get github.com/jamra/gocleo
- Give the user the ability to add and remove words from the index.
- More robust Unit testing