Sketches is a library for sketching, locality sensitive hashing, approximate similarity search and other things.
Basic use case is search for all similar items in a dataset.
import atrox.sketch._
val sets: IndexedSeq[Set[Int]] = loadMyData()
val (bands, hashes) = LSH.pickHashesAndBands(threshold = 0.5, maxHashes = 64)
val lsh = LSH(sets, MinHash[Set[Int]](hashes), LSHBuildCfg(bands = bands))
val cfg = LSHCfg(maxResults = 50)
for (Sim(idx1, idx2, estimate, similarity) <- lsh.withConfig(cfg).allSimilarItems) {
println(s"similarity between item $idx1 and $idx2 is estimated to $estimate")
}
There's more configuration options available.
lsh.withConfig(LSHCfg(
// Return only the 100 most relevant results.
// It's strongly recommended to use this option.
maxResults = 100,
// Perform similarity search in parallel.
parallel = true,
// Use as much memory as needed. This leads to faster bulk queries but
// might need to store the complete result set in memory.
compact = false,
// Skip anomalously large buckets. This speeds things up quite a bit.
maxBucketSize = sets.size / 10
))
And more query methods.
lsh.similarItems(q)
lsh.similarIndexes(q)
lsh.allSimilarItems
lsh.allSimilarIndexes
And more sketching methods.
- MinHash and SingleBitMinHash for estimating Jaccard index
- WeightedMinHash for estimating weighted Jaccard index
- RandomHyperplanes for estimation cosine similarity
- RandomProjections for LSH based on euclidean distance
- HammingDistance
- SimHash