Nixietune is a GPU fine-tuning harness for semantic search models. Built for the Nixiesearch search engine:
- a set of state-of-the-art recipes to fine-tune existing generic semantic search models like E5/BGE/MiniLM on your data
- based on battle-tested sentence-transformers library, but uses modern Huggingface ecosystem for training: multi-GPU and distributed training, FP16/BF16 mixed-precision, gradient checkpointing/accumulation and dataset caching.
- Can be used with and without hard negatives, supports InfoNCE/Cosine/Contrastive/Triples losses.
To fine-tune a semantic search embedding model on your data:
- Install nixietune: you need a GPU for that!
- Format your data in a nixietune format: a JSONL file with a specific schema.
- Run the training: for base/small models it takes less than an hour on a single GPU.
- Tinker with params: choose the best loss and make your model training faster.
Nixietune is published to PyPi:
# setup the environment
python -m venv .venv && source .venv/bin/activate
# install dependencies
pip install nixietune
- Nixietune is tested with Python 3.10 and 3.11.
- 3.12 is not yet supported by PyTorch
Nixietune expects a specific JSONL input format for your documents:
{
"query": "pizza",
"positive": [
"Standard Serious Pizza",
"60 Seconds to Napoli",
],
"negative": [
"Burgermeister",
"Risa Chicken",
]
}
The document schema can be described as:
query
: required, string. An anchor search query for the whole group of documents.positive
: required, list[string]. A one or more positive documents for the query above.negative
: optional, list[string]. A zero or more negative documents for the query.
The InfoNCE
loss supports negative-less training - when all the other in-batch positives are treated as negatives.
Let's fine-tune a sentence-transformers/all-MiniLM-L6-v2 embedding model on a nixiesearch/ms-marco-hard-negatives dataset, using the InfoNCE loss.
python -m nixietune examples/msmarco.json
The msmarco.json
configuration file is based on a HuggingFace Transformer TrainingArguments with some extra settings:
{
"train_dataset": "nixiesearch/ms-marco-hard-negatives",
"eval_dataset": "nixiesearch/ms_marco",
"seq_len": 128,
"target": "infonce",
"model_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
"output_dir": "minilm-msmarco-infonce8",
"num_train_epochs": 1,
"seed": 33,
"per_device_train_batch_size": 256,
"per_device_eval_batch_size": 256,
"fp16": true,
"logging_dir": "logs",
"gradient_checkpointing": true,
"gradient_accumulation_steps": 1,
"dataloader_num_workers": 14,
"eval_steps": 0.05,
"logging_steps": 0.05,
"evaluation_strategy": "steps",
"torch_compile": true,
"report_to": [],
"save_strategy": "epoch",
"num_negatives": 8,
"query_prefix": "query: ",
"document_prefix": "passage: "
}
It takes around 60 minutes to fine-tune an all-MiniLM-L6-v2
on a MS MARCO hard negatives on a single RTX4090 GPU.
The following training parameters are worth tuning:
target
: the training recipe. Currently supported targets areinfonce
/cosine_similarity
/contrastive
/triplet
. If not sure, start withinfonce
.model_name_or_path
: which model to fine-tune. Any SBERT-supported model should work.per_device_train_batch_size
: batch size. Too small values lead to sub-par quality and slow training. Too large need a lot of VRAM. Start with 128 and go up.seq_len
: context length of the model. Usually it's around 128-160 for most models in MTEB leaderboard.gradient_checkpointing
: reduces VRAM usage sugnificantly (up to 70%) with a small 10% performance penalty, as we recompute gradients instead of storing them. If unsure, choosetrue
num_negatives
: forinfonce
/triplet
targets, how many negatives from the dataset to select.query_prefix
anddocument_prefix
: prompt labels for asymmetric models - when the model can distinguish between query and document passages.
Apache 2.0