hackerllama/blog/posts/sentence_embeddings/

Question

hackerllama/blog/posts/sentence_embeddings/

utterances-bot opened this issue a year ago · 7 comments

hackerllama - Sentence Embeddings

Everything you wanted to know about sentence embeddings (and maybe a bit more)

https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/

Answer 1 · 2024-01-17T09:23:35.000Z

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Answer 2 · 2024-01-17T09:41:45.000Z

Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!

Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

Answer 3 · 2024-01-17T10:56:55.000Z

Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?

Answer 4 · 2024-01-17T11:10:57.000Z

Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences

CBOW context window is fixed, so it doesn't capture the broader context outside the window
CBOW treats all context words in the same way, while BERT uses attention mechanisms to weigh each token embedding differently
CBOW does not have a sense of directionality due to this
BERT can actually mask multiple tokens at the same time

Answer 5 · 2024-04-21T14:26:57.000Z

I think something has changed about the quora dataset used in the colab example. I'm getting this error:

from datasets import load_dataset
dataset = load_dataset("quora")["train"]

TypeError: http_get() got an unexpected keyword argument 'displayed_filename'

Answer 6 · 2024-08-22T19:51:15.000Z

Just what I needed entering the world of LLMs, thank you a lot!

Answer 7 · 2024-09-27T07:31:45.000Z

Hi I would like to know other than using Sentence Transformers (sbert), what other open source sentence embeddings methods can I choose?

I find two other options, InferSent and google's USE. But InferSent seems dead now and USE is not widely used too. In 2024 I don't think I should use Doc2Vec or Word2Vec, right ?

So why does sbert take over sentence embeddings methods ?