lsh-scala
A Locality-Sensitive Hashing Library for Scala with optional Redis storage.
Overview
Locality-sensitive hashing (LSH) is a method for bucketing similar items together. It works using a hashing function that, unlike cryptographic hashes, maps similar input vectors to the same output hash. The input vectors are domain specific and can be raw data or processed feature vectors.
Locality-sensitive hashing has been applied to a number of different problems, including audio/visual fingerprinting, approximate nearest neighbours (ANN), image near-duplicate detection, and object similarity.
There are several ways to define the hash function for LSH. This library uses the random projection method as described in Andoni and Indyk (2006).
This library also makes use of the Breeze numerical processing library. Breeze will make use of highly-optimized native BLAS libraries if they are available on your system.
Requirements
- Scala 2.11+
- Breeze
- Redis and scala-redis if you wish to use Redis-backed storage
- Jacks for JSON serialization to/from Redis
For the test suites you will also need:
- scalatest 2.2+
- embedded-redis
Installation
Add to Build.scala
or build.sbt
libraryDependencies ++= Seq(
"io.krom" % "lsh-scala_2.11" % "0.1"
)
Clone the repository and use in your projects.
TODO: Maven and sbt setup
Usage
To create a Lsh
object you need to supply the dimensions of the hash and the number of sets of projection tables you
want to use.
- numBits - the number of bits in the hash space. This determines the number of buckets you will have: 2^numBits. The higher this number, the more buckets. More buckets means a smaller chance of getting a globally optimal solution, but it also means better performance
- numDimensions - the length of the input vectors. This needs to match your domain data you want to index.
- numTables - the number of sets of projections you want to use. Tables use different random projections to reduce the chance of missing nearby points in other buckets. The more tables, the lower chance of getting a very inaccurate result, but the performance becomes worse.
In-Memory Storage
The default storage is in-memory. Simply create the Lsh
object with no storage config.
import io.krom.lsh.Lsh
val numBits = 16
val numDimensions = 200
val numTables = 5
val lsh = Lsh(numBits, numDimensions, numTables)
Redis-Backed Storage
Redis-backed storage requires a Redis server running somewhere that is accessible. This storage choice uses Redis databases for the different tables, therefore make sure you don't set numTables to greater than the number of databases set up on your Redis server. For some Redis servers, this could be as low as 16. This isn't usually a restriction as a numTables value of between 3 and 5 is usually enough.
First, ensure Redis is installed and running on your system:
MacOS X:
> brew install redis
> redis-server
To create the Lsh
object:
import io.krom.lsh.Lsh
val numBits = 16
val numDimensions = 200
val numTables = 5
val redisConfig = HashMap( "host" -> "localhost", "port" -> "6379")
val lsh = Lsh(numBits, numDimensions, numTables, storageConfig = Some(redisConfig))
Indexing, Updating and Querying
Indexing:
Indexing hashes the data and stores the label and data in the backing storage (in-memory or Redis).
import breeze.linalg.DenseVector
val data = DenseVector(0.1, 0.2)
val label = "datapoint"
lsh.store(data, label)
Updating:
Updating will replace the existing entry with the same label.
val updatedData = DenseVector(0.3,0.4)
lsh.update(updatedData, label)
Querying:
Querying returns the top most similar items to the item specified. By default, it uses Euclidean distance and returns a maximum of 25 items. Currently, it will also return the item specified, if it's in the index.
import io.krom.lsh.DistanceFunction._
val data = DenseVector(0.1, 0.3)
// returns up to 25 items; uses Euclidean
val items1 = lsh.query(data)
// returns up to 50 items; uses Euclidean
val items2 = lsh.query(data, maxItems = 50)
// returns up to 25 items; uses Cosine distance
val items3 = lsh.query(data, distanceFunction = cosineDistance)
query()
accepts any function that maps two DenseVector[Double]
inputs to a single Double
output, so you can define
your own distance measures and use them.
Prefixes
Prefixes allows you to store different indexes in the same storage. This is mostly useful for when you are using Redis-backed storage, where you may be sharing a Redis server for a number of indexes.
It also allows you to keep different kinds of objects separate, which allows for greater performance in instances where you are only looking for specific kinds of objects in an otherwise heterogeneous index.
To use a prefix, simply specify the prefix in the constructor:
import io.krom.lsh.Lsh
val numBits = 16
val numDimensions = 200
val numTables = 5
val prefix = "myIndex"
val redisConfig = HashMap( "host" -> "localhost", "port" -> "6379")
val lsh = Lsh(numBits, numDimensions, numTables, prefix = Some(prefix), storageConfig = Some(redisConfig))
Using Pre-Calculated Projections
This library can use pre-calculated random projections. This enables use with previously indexed objects (for example, stored in a separate Redis server), and also allows multiple indexing/querying processes to share a single backing store if they all load the same random projections.
To pre-calculate the random projections:
import io.krom.lsh.Lsh
val numBits = 16
val numDimensions = 200
val numTables = 5
Lsh.generateRandomProjections(numBits, numDimensions, numTables, "random_projections.dat")
To use the pre-calculated random projections, specify the filename when creating the Lsh
object:
import io.krom.lsh.Lsh
val numBits = 16
val numDimensions = 200
val numTables = 5
val prefix = "myIndex"
val projectionsFilename = "random_projections.dat"
val lsh = Lsh(numBits, numDimensions, numTables, projectionsFilename = Some(projectionsFilename))
Contributing
Contributions of all kinds are always welcome.
New features and bug fixes should be submitted via pull-request. The preferred way is to fork the repository and submit the PR from there. All submissions should include tests
Any bug reports or feature requests (or comments or promises of gifts) should be submitted via the GitHub issues page. No request is too big or too small.
Changelog
All bug fixes and new features for each version are described in the lsh-scala Changelog
License
lsh-scala is released under the Apache 2.0 License