hackerllama/blog/posts/sentence_embeddings/
utterances-bot opened this issue · 7 comments
hackerllama - Sentence Embeddings
Everything you wanted to know about sentence embeddings (and maybe a bit more)
https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/
Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!
Hello, author, many thanks for your explaination. You said "We’ll start using all-MiniLM-L6-v2. It’s not the best open-source embedding model", I want to know which model is the best and how to find the best model list? I am fresh, sorry for bringing this question, thank you very much!
Hi @songxujay, author is covering this in the "Selecting and evaluating models" part. Have a look at it. One of the main source is still the MTEB Leaderboard - https://huggingface.co/spaces/mteb/leaderboard
Hi! Thank you for the great article. To better understand the differences between word2vec- and Transformer-based embeddings, could you elaborate how the masked language modelling objective of BERT is different from the CBOW objective in word2vec (which as I understand is also about "filling in a blank"). Is it that the objectives are similar but the neural net architectures differ in these two approaches, allowing BERT to add contextual info?
Hey @arnoldlayne0! Overall you're right, BERT and CBOW objectives have some similarities. Here are some differences
- CBOW context window is fixed, so it doesn't capture the broader context outside the window
- CBOW treats all context words in the same way, while BERT uses attention mechanisms to weigh each token embedding differently
- CBOW does not have a sense of directionality due to this
- BERT can actually mask multiple tokens at the same time
I think something has changed about the quora dataset used in the colab example. I'm getting this error:
from datasets import load_dataset
dataset = load_dataset("quora")["train"]
TypeError: http_get() got an unexpected keyword argument 'displayed_filename'
Just what I needed entering the world of LLMs, thank you a lot!
Hi I would like to know other than using Sentence Transformers (sbert), what other open source sentence embeddings methods can I choose?
I find two other options, InferSent and google's USE. But InferSent seems dead now and USE is not widely used too. In 2024 I don't think I should use Doc2Vec or Word2Vec, right ?
So why does sbert take over sentence embeddings methods ?