/sketches

Scala library for sketching, locality sensitive hashing, approximate similarity search and other things

Primary LanguageScalaOtherNOASSERTION

Sketches

Sketches is a library for sketching, locality sensitive hashing, approximate similarity search and other things.

Usage

Basic use case is search for all similar items in a dataset.

import atrox.sketch._

val sets: IndexedSeq[Set[Int]] = loadMyData()

val (bands, hashes) = LSH.pickHashesAndBands(threshold = 0.5, maxHashes = 64)
val lsh = LSH(sets, MinHash[Set[Int]](hashes), LSHBuildCfg(bands = bands))

val cfg = LSHCfg(maxResults = 50)

for (Sim(idx1, idx2, estimate, similarity) <- lsh.withConfig(cfg).allSimilarItems) {
  println(s"similarity between item $idx1 and $idx2 is estimated to $estimate")
}

There's more configuration options available.

lsh.withConfig(LSHCfg(
  // Return only the 100 most relevant results.
  // It's strongly recommended to use this option.
  maxResults = 100,

  // Perform similarity search in parallel.
  parallel = true,

  // Use as much memory as needed. This leads to faster bulk queries but
  // might need to store the complete result set in memory.
  compact = false,

  // Skip anomalously large buckets. This speeds things up quite a bit.
  maxBucketSize = sets.size / 10
))

And more query methods.

lsh.similarItems(q)
lsh.similarIndexes(q)
lsh.allSimilarItems
lsh.allSimilarIndexes

And more sketching methods.

  • MinHash and SingleBitMinHash for estimating Jaccard index
  • WeightedMinHash for estimating weighted Jaccard index
  • RandomHyperplanes for estimation cosine similarity
  • RandomProjections for LSH based on euclidean distance
  • HammingDistance
  • SimHash