ankane/neighbor

`nearest_neighbors` on average of relational vectors?

marckohlbrugge opened this issue · 2 comments

This is in part a more generic question about embeddings and vectors, but I'm curious how it would apply to neighbor specifically.

Let's say I have a table of paragraphs, each with their appropriate embeddings.

Each paragraph belongs_to :chapter.

How can I use neighbor to find the nearest chapter for a given embedding?

It's my understanding (but I'm not 100% sure of this), that you could simply take the average of the paragraph embeddings for a given chapter, to get the embedding of that chapter. (e.g. if you were to calculate the embedding vectors for the whole chapter text, you'd end up with the same embeddings as averaging the embeddings of each individual paragraph).

For example I tried the following, but nearest_neighbors isn't defined on ActiveRecord::Relation

Chapter.joins(:paragraphs).nearest_neighbors("AVG(paragraphs.embedding)", [0.9, 1.3, 1.1], distance: "euclidean").first

Hey @marckohlbrugge, you'll need to store the embeddings on chapters.

I'm not an expert in this area, but you could try averaging to see how it goes. Another idea would be to use the same approach that generated the paragraph vectors. You could also use LDA to generate embeddings from scratch for each chapter (with something like Tomoto).

Also, you'll likely want to use cosine distance (and set normalize: true if using cube):

Chapter.nearest_neighbors(:embedding, [0.9, 1.3, 1.1], distance: "cosine").first

Thanks @ankane, this is very helpful.

Averaging the vectors seems to provide relevant results, so I'll go with that.

Will also switch to "cosine" 👍