Nixietune: a fine-tuner for semantic search models

Nixietune is a GPU fine-tuning harness for semantic search models. Built for the Nixiesearch search engine:

a set of state-of-the-art recipes to fine-tune existing generic semantic search models like E5/BGE/MiniLM on your data
based on battle-tested sentence-transformers library, but uses modern Huggingface ecosystem for training: multi-GPU and distributed training, FP16/BF16 mixed-precision, gradient checkpointing/accumulation and dataset caching.
Can be used with and without hard negatives, supports InfoNCE/Cosine/Contrastive/Triples losses.

Usage

To fine-tune a semantic search embedding model on your data:

Install nixietune: you need a GPU for that!
Format your data in a nixietune format: a JSONL file with a specific schema.
Run the training: for base/small models it takes less than an hour on a single GPU.
Tinker with params: choose the best loss and make your model training faster.

Installation

Nixietune is published to PyPi:

# setup the environment
python -m venv .venv && source .venv/bin/activate
# install dependencies
pip install nixietune

Nixietune is tested with Python 3.10 and 3.11.
3.12 is not yet supported by PyTorch

Data format

Nixietune expects a specific JSONL input format for your documents:

{
    "query": "pizza",
    "positive": [
        "Standard Serious Pizza",
        "60 Seconds to Napoli",
    ],
    "negative": [
        "Burgermeister",
        "Risa Chicken",
    ]
}

The document schema can be described as:

query: required, string. An anchor search query for the whole group of documents.
positive: required, list[string]. A one or more positive documents for the query above.
negative: optional, list[string]. A zero or more negative documents for the query.

The InfoNCE loss supports negative-less training - when all the other in-batch positives are treated as negatives.

Run the training

Let's fine-tune a sentence-transformers/all-MiniLM-L6-v2 embedding model on a nixiesearch/ms-marco-hard-negatives dataset, using the InfoNCE loss.

python -m nixietune examples/msmarco.json

The msmarco.json configuration file is based on a HuggingFace Transformer TrainingArguments with some extra settings:

{
    "train_dataset": "nixiesearch/ms-marco-hard-negatives",
    "eval_dataset": "nixiesearch/ms_marco",
    "seq_len": 128,
    "target": "infonce",
    "model_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
    "output_dir": "minilm-msmarco-infonce8",
    "num_train_epochs": 1,
    "seed": 33,
    "per_device_train_batch_size": 256,
    "per_device_eval_batch_size": 256,
    "fp16": true,
    "logging_dir": "logs",
    "gradient_checkpointing": true,
    "gradient_accumulation_steps": 1,
    "dataloader_num_workers": 14,
    "eval_steps": 0.05,
    "logging_steps": 0.05,
    "evaluation_strategy": "steps",
    "torch_compile": true,
    "report_to": [],
    "save_strategy": "epoch",
    "num_negatives": 8,
    "query_prefix": "query: ",
    "document_prefix": "passage: "
}

It takes around 60 minutes to fine-tune an all-MiniLM-L6-v2 on a MS MARCO hard negatives on a single RTX4090 GPU.

Choosing the best parameters

The following training parameters are worth tuning:

target: the training recipe. Currently supported targets are infonce/cosine_similarity/contrastive/triplet. If not sure, start with infonce.
model_name_or_path: which model to fine-tune. Any SBERT-supported model should work.
per_device_train_batch_size: batch size. Too small values lead to sub-par quality and slow training. Too large need a lot of VRAM. Start with 128 and go up.
seq_len: context length of the model. Usually it's around 128-160 for most models in MTEB leaderboard.
gradient_checkpointing: reduces VRAM usage sugnificantly (up to 70%) with a small 10% performance penalty, as we recompute gradients instead of storing them. If unsure, choose true
num_negatives: for infonce/triplet targets, how many negatives from the dataset to select.
query_prefix and document_prefix: prompt labels for asymmetric models - when the model can distinguish between query and document passages.

License

Apache 2.0