koaning/whatlies

'merge' documentation

argideritzalpea opened this issue · 4 comments

In the documentation, it is unclear that the 'merge' feature concatenates embeddings across the 'names' axis as opposed to the vector axis.

It would be great to either make this explicit in the documentation, or to allow a merge operation that concatenates embedding sets such that they are concatenated across the numeric / vector axis
e.g.

EmbeddingSet(Embedding('name', [1, 2, 3])).merge(EmbeddingSet(Embedding('name', [4, 5, 6]))) => EmbeddingSet(Embedding('name', [1, 2, 3, 4, 5, 6]))

Right now I am not aware of an option to do this.
The current behavior returns the following:

EmbeddingSet(Embedding('name', [1, 2, 3])).merge(EmbeddingSet(Embedding('name', [4, 5, 6]))) => EmbeddingSet(Embedding('name', [4, 5, 6]))

as it is designed to merge EmbeddingSets of different names. It would be very useful to concatenate across embedding vectors to combine distinct representations together into single Embeddings.

Thanks for raising this issue, @Imod7 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

I think you raise an interesting point, but I'd like to double check that I am interpreting it correctly. It seems like you're interested in concatenating vectors that have the same token. I'm assuming that the use-case is to combine the output of language models.

Am I correct here? If you're interested in combining language models, it feels like we might want to have a way to concatenate vectors at the Language-level, which is a step before the EmbeddingSet-level.

Hi @koaning, yeah, exactly, that was what I intend to say. Taking a quick look, it might seem that introducing such a feature at the EmbeddingSet would make it more flexible, since this would allow combination of arbitrary name/vector sets that don't necessarily provene from the preset language model APIs. But perhaps this is not an issue? One of my vector sets, for example, requires reading in names and vectors with the EmbeddingSet.from_names_X method as the embeddings are from a model that aren't supported in the 'Language' class loading.

I'm closing issues because ever since the project moved to my personal account it's been more into maintenance mode than a "active work" mode.