A Scala Implementation of Annoy which searches nearest neighbors given query point.
Ann4s also provides DataFrame-based API for Apache Spark.
import ann4s._
object AnnoyExample {
def main(args: Array[String]) {
val f = 40
val metric: Metric = Angular // or Euclidean
val t = new AnnoyIndex(f, metric) // Length of item vector that will be indexed
(0 until 1000) foreach { i =>
val v = Array.fill(f)(scala.util.Random.nextGaussian().toFloat)
t.addItem(i, v)
}
t.build(10)
// t.getNnsByItem(0, 1000) runs using HeapByteBuffer (memory)
t.save("test.ann") // `test.ann` is compatible with the native Annoy
// after `save` t.getNnsByItem(0, 1000) runs using MappedFile (file-based)
println(t.getNnsByItem(0, 1000).mkString(",")) // will find the 1000 nearest neighbors
}
}
val dataset: DataFrame = ??? // your dataset
val alsModel: ALSModel = new ALS()
.fit(dataset)
val annoyModel: AnnoyModel = new Annoy()
.setDimension(alsModel.rank)
.fit(alsModel.itemFactors)
val result: DataFrame = annoyModel
.setK(10) // find 10 neighbors
.transform(alsModel.itemFactors)
result.show()
The result.show()
shows
+---+--------+-----------+
| id|neighbor| distance|
+---+--------+-----------+
| 0| 0| 0.0|
| 0| 50|0.014339785|
...
| 1| 1| 0.0|
| 1| 36|0.011467933|
...
+---+--------+-----------+
- For more information of ALS see this link
- Working example is at 'src/test/scala/ann4s/spark/AnnoySparkSpec.scala'
resolvers += Resolver.bintrayRepo("mskimm", "maven")
libraryDependencies += "com.github.mskimm" %% "ann4s" % "0.0.6"
0.0.6
is built with Apache Spark 1.6.2
- https://github.com/spotify/annoy : native implementation with serveral bindings like Python
- https://github.com/pishen/annoy4s : Scala wrapper using JNA
- https://github.com/spotify/annoy-java : Java implementation