/annoy_ex

Elixir Bindings for Spotify's Annoy Library

Primary LanguageC++OtherNOASSERTION

AnnoyEx

A NIF binding to Annoy, Spotify's C++ library for approximate nearest neighbors.

It implements all of the methods in the Spotify library and all index types except Hamming.

Code Examples

If you have your own vectors.

iex(1)> f = 40
40

iex(2)> t = AnnoyEx.new(f)
#Reference<0.3703577626.1733689351.247165>

iex(3)> Enum.each(0..999,
...(3)>   fn i -> AnnoyEx.add_item(t, i, Enum.map(0..f-1, fn _ -> :rand.normal() end))
...(3)> end)
:ok

iex(4)> AnnoyEx.build(t, 10)
:ok

iex(5)> AnnoyEx.save(t, "test.ann")
:ok

iex(6)> u = AnnoyEx.new(f, :angular)
#Reference<0.3703577626.1733689345.248447>

iex(7)> AnnoyEx.load(u, "test.ann") # super fast, will just mmap the file
:ok

iex(8)> AnnoyEx.get_nns_by_item(u, 0, 10) # will find the 10 nearest neighbors
{[0, 677, 837, 478, 793, 183, 265, 623, 751, 268],
[0.0, 1.1232969760894775, 1.1271791458129883, 1.1428979635238647,
1.1504143476486206, 1.1632753610610962, 1.1647002696990967,
1.1801577806472778, 1.2018792629241943, 1.2058889865875244]}

Word Embeddings with Pretrained GloVe vectors

Go to https://nlp.stanford.edu/projects/glove/ and download a word-embedding file, eg. https://nlp.stanford.edu/data/glove.42B.300d.zip

By its nature creating and building the index can be quite slow but querying afterward is fast. Saving built indexes can help with this.

   # Build and save the index.
   idx = AnnoyEx.new(300)

   index_to_word =
     File.stream!("glove.42B.300d.txt") |>
     Stream.with_index() |>
     Stream.map(fn {line,item} ->
       fields = String.trim_trailing(line) |> String.split(" ")
       word = hd(fields)
       vec = Enum.map(tl(fields), fn x -> Float.parse(x) |> elem(0) end)
       AnnoyEx.add_item(idx, item, vec)
       {item, word}
     end) |>
     Enum.into(%{})

   AnnoyEx.build(idx,10)

   AnnoyEx.save(idx, "glove.42B.300d.idx")
   File.write!("glove.42B.300d.i2w", :erlang.term_to_binary(index_to_word))

The saved data can now be queried for similar words, eg. the 10 closest words to "dog":

iex(1)> index_to_word = File.read!("glove.42B.300d.i2w") |> :erlang.binary_to_term
%{
  1774702 => "bedanya",
  ...
}

iex(2)> word_to_index = Map.new(index_to_word, fn {k, v} -> {v, k} end)
%{
  "timout" => 816588,
  ...
}

iex(3)> idx = AnnoyEx.new(300)
#Reference<0.2696563938.657326081.165701>

iex(4)> AnnoyEx.load(idx, "glove.42B.300d.idx")
:ok

iex(5)> dog_id = word_to_index["dog"]
828

iex(6)> {word_ids, distances} = AnnoyEx.get_nns_by_item(idx, dog_id, 10)
{[828, 1818, 5203, 3394, 1642, 1937, 6798, 16091, 7080, 16440],
 [0.0, 0.5301183462142944, 0.617365300655365, 0.7635669112205505,
  0.8098030686378479, 0.8700820803642273, 0.8896471261978149,
  0.8973260521888733, 0.9196945428848267, 0.9411463737487793]}

iex(7)> Enum.map(word_ids, fn word_id -> index_to_word[word_id] end)
["dog", "dogs", "puppy", "cats", "animal", "horse", "rabbit", "paws", "pig",
 "paw"]

Installation

n.b. This library currently only runs on Linux.

You will need a C++14 compiler and the Erlang header files required for compiling NIFs.

If available in Hex, the package can be installed by adding annoy_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:annoy_ex, "~> 1.0.0"}
  ]
end

Working with source

Before running tests make sure to build the shared library with make annoy