`nearest_neighbors` on average of relational vectors?
marckohlbrugge opened this issue · 2 comments
This is in part a more generic question about embeddings and vectors, but I'm curious how it would apply to neighbor
specifically.
Let's say I have a table of paragraphs
, each with their appropriate embeddings
.
Each paragraph
belongs_to :chapter
.
How can I use neighbor
to find the nearest chapter for a given embedding?
It's my understanding (but I'm not 100% sure of this), that you could simply take the average of the paragraph embeddings for a given chapter, to get the embedding of that chapter. (e.g. if you were to calculate the embedding vectors for the whole chapter text, you'd end up with the same embeddings as averaging the embeddings of each individual paragraph).
For example I tried the following, but nearest_neighbors
isn't defined on ActiveRecord::Relation
Chapter.joins(:paragraphs).nearest_neighbors("AVG(paragraphs.embedding)", [0.9, 1.3, 1.1], distance: "euclidean").first
Hey @marckohlbrugge, you'll need to store the embeddings on chapters.
I'm not an expert in this area, but you could try averaging to see how it goes. Another idea would be to use the same approach that generated the paragraph vectors. You could also use LDA to generate embeddings from scratch for each chapter (with something like Tomoto).
Also, you'll likely want to use cosine distance (and set normalize: true
if using cube):
Chapter.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "cosine").first
Thanks @ankane, this is very helpful.
Averaging the vectors seems to provide relevant results, so I'll go with that.
Will also switch to "cosine"
👍