BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings (NAACL24)
Arxiv: https://arxiv.org/abs/2311.05296
💡 Highlight: To the best of our knowledge, our work is the first to extensively investigate the effects of backward dependencies in autoregressive LLMs architectures for sentence embedding learning.
angle_emb
and billm
are required. You can install them by running the following commands:
python -m pip install -r requirements.txt
We trained our models using MultiNLI and NLI datasets (they can be downloaded from sentence-transformers https://sbert.net/datasets/AllNLI.tsv.gz)
We use the following preprocessing steps to obtain the training set:
- Transform the original format to
{"text": "text", "positive": "positive of text", "negative": "negative of text"}
. - Augment the negative samples with retrieval and reranking techniques.
We have pushed the processed train set to huggingface:
BiLLM_START_INDEX=31 WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 train.py \
--train_name_or_path SeanLee97/all_nli_angle_format_b \
--save_dir ckpts/bellm-llama-7b-nli \
--model_name NousResearch/Llama-2-7b-chat-hf \
--prompt_template 'The representative word for sentence {text} is:"' \
--pooling_strategy avg \
--ibn_w 20.0 --cosine_w 0.0 --angle_w 1.0 --learning_rate 2e-4 --maxlen 60 \
--apply_lora 1 --lora_r 64 --lora_alpha 128 --lora_dropout 0.1 \
--is_llm 1 --apply_billm 1 --billm_model_class LlamaForCausalLM \
--push_to_hub 0 \
--logging_steps 5 --save_steps 50 --warmup_steps 80 --batch_size 256 --seed 42 --load_kbit 4 \
--gradient_accumulation_steps 32 --epochs 3 --fp16 1
If you want to push the model to HuggingFace automatically, you can add following extra arguments:
--push_to_hub 1 \
--hub_model_id {YOUR_MODEL_ID} \
--hub_private_repo 1
- continue to finetune on augmented data:
BiLLM_START_INDEX=31 WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 train.py \
--train_name_or_path SeanLee97/all_nli_aug_angle_format_b \
--pretrained_lora_path ckpts/bellm-llama-7b-nli \
--save_dir ckpts/bellm-llama-7b-nli-2 \
--model_name NousResearch/Llama-2-7b-hf \
--ibn_w 1.0 --cosine_w 0.0 --angle_w 0.0 --learning_rate 2e-4 --maxlen 60 \
--is_llm 1 --apply_lora 1 --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 \
--push_to_hub 0 \
--save_steps 200 --batch_size 256 --seed 42 --load_kbit 4 --gradient_accumulation_steps 32 --epochs 3 --fp16 1
Tips:
- Here we only use contrastive learning loss (ibn_w = 1.0, cosine_w = 0.0, angle_w = 0.0). It is recommended to use AnglE (set
angle_w
> 0) to further improve the performance. BiLLM_START_INDEX=31
is used to set layers greater than 31 to be bidirectional. Since the LLaMA-7B has 32 layers, thusBiLLM_START_INDEX=31
will convert the final layer bidirectional.
- download senteval datasets
cd SentEval/data
sh download_dataset.sh
- evaluate on STS benchmark
BiLLM_START_INDEX=31 CUDA_VISIBLE_DEVICES=0 python eval_sts.py \
--model_name_or_path NousResearch/Llama-2-7b-hf \
--lora_name_or_path SeanLee97/bellm-llama-7b-nli \
--apply_bfloat16 0
Results:
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 78.36 | 90.88 | 86.28 | 89.89 | 86.59 | 88.89 | 83.17 | 86.29 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
Here, we combine AnglE and BiLLM to infer.
import os
# set environment variable for BiLLM_START_INDEX before importing the model
os.environ['BiLLM_START_INDEX'] = '31'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
from scipy import spatial
from model import AnglE
# 1. load model
model = AnglE.from_pretrained('NousResearch/Llama-2-7b-hf', pretrained_lora_path='SeanLee97/bellm-llama-7b-nli').cuda()
# 2. set prompt
model.set_prompt(prompt='The representative word for sentence {text} is:"')
# 3. encode
docs = ['I like apples', 'I like fruit', 'i am hiking.']
vecs = model.encode([{'text': doc} for doc in docs])
print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))
output
cos sim (0, 1): 0.8061720132827759
cos sim (0, 2) 0.2913861870765686
cos sim (1, 2): 0.29943591356277466
You can fine-tune the model on your own dataset by specifying --pretrained_lora_path
to our pre-trained LoRA models.
@inproceedings{li2024bellm,
title = "BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings",
author = "Li, Xianming and Li, Jing",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics",
year = "2024",
publisher = "Association for Computational Linguistics"
}
Welcome to follow related works:
- AnglE (BeLLM's elder sister 👭): https://arxiv.org/abs/2309.12871
- LS-LLaMA (BeLLM's father 👨🏻): https://arxiv.org/abs/2310.01208
- We are happy to have you here! Feel free to open an issue (title starts with [Friendship Request]) to report the related works.