/finetuner

:dart: Task-oriented finetuning for better embeddings on neural search

Primary LanguagePythonApache License 2.0Apache-2.0



Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications.

Task-oriented finetuning for better embeddings on neural search

PyPI Codecov branch PyPI - Downloads from official pypistats

Fine-tuning is an effective way to improve performance on neural search tasks. However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.

Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and infrastructure in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models, making them production-ready without buying expensive hardware.

📈 Performance promise: enhance the performance of pre-trained models and deliver state-of-the-art performance on domain-specific neural search applications.

🔱 Simple yet powerful: easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight freezing, dimensionality reduction, hard-negative mining, cross-modal models, and distributed training.

All-in-cloud: train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud without worrying about resource availability, complex integration, or infrastructure costs.

Benchmarks

Model Task Metric Pretrained Finetuned Delta Run it!
BERT Quora Question Answering mRR 0.835 0.967 15.8%

Open In Colab

Recall 0.915 0.963 5.3%
ResNet Visual similarity search on TLL mAP 0.110 0.196 78.2%

Open In Colab

Recall 0.249 0.460 84.7%
CLIP Deep Fashion text-to-image search mRR 0.575 0.676 17.4%

Open In Colab

Recall 0.473 0.564 19.2%
M-CLIP Cross market product recommendation (German) mRR 0.430 0.648 50.7%

Open In Colab

Recall 0.247 0.340 37.7%

All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.

Install

Make sure you have Python 3.7+ installed. Finetuner can be installed via pip by executing:

pip install -U finetuner

If you want to encode docarray.DocumentArray objects with the finetuner.encode function, you need to install "finetuner[full]". This includes a number of additional dependencies, which are necessary for encoding: Torch, Torchvision and OpenCLIP:

pip install "finetuner[full]"

⚠️ Starting with version 0.5.0, Finetuner computing is performed on Jina AI Cloud. The last local version is 0.4.1. This version is still available for installation via pip. See Finetuner git tags and releases.

Get Started

The following code snippet describes how to fine-tune ResNet50 on the Totally Looks Like dataset. You can run it as-is. The model and training data are already hosted in Jina AI Cloud and Finetuner will download them automatically. (NB: If there is already a run called resnet50-tll-run, choose a different run-name in the code below.)

import finetuner
from finetuner.callback import EvaluationCallback

finetuner.login()

run = finetuner.fit(
    model='resnet50',
    run_name='resnet50-tll-run',
    train_data='tll-train-data',
    callbacks=[
        EvaluationCallback(
            query_data='tll-test-query-data',
            index_data='tll-test-index-data',
        )
    ],
)

This code snippet describes the following steps:

  1. Log in to Jina AI Cloud.
  2. Select backbone model, training and evaluation data for your evaluation callback.
  3. Start the cloud run.

You can also pass data to Finetuner as a CSV file or a DocumentArray object, as described in the Finetuner documentation.

Depending on the data, task, model, hyperparameters, fine-tuning might take some time to finish. You can leave your jobs to run on the Jina AI Cloud, and later reconnect to them, using code like this below:

import finetuner

finetuner.login()

run = finetuner.get_run('resnet50-tll-run')

for log_entry in run.stream_logs():
    print(log_entry)

run.save_artifact('resnet-tll')

This code logs into Jina AI Cloud, then connects to your run by name. After that, it does the following:

  • Monitors the status of the run and prints out the logs.
  • Saves the model once fine-tuning is done.

Using Finetuner to encode

Finetuner has interfaces for using models to do encoding:

import finetuner
from docarray import Document, DocumentArray

da = DocumentArray([Document(uri='~/Pictures/your_img.png')])

model = finetuner.get_model('resnet-tll')
finetuner.encode(model=model, data=da)

da.summary()

When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the finetuner.encode method will return a np.ndarray of embeddings, instead of a docarray.DocumentArray:

import finetuner
from docarray import Document, DocumentArray

images = ['~/Pictures/your_img.png']

model = finetuner.get_model('resnet-tll')
embeddings = finetuner.encode(model=model, data=images)

Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file.

A CSV file is a tab or comma-delimited plain text file. For example:

This is an apple    apple_label
This is a pear      pear_label
...

The file should have two columns: The first for the data and the second for the category label.

You can then provide a path to a CSV file as training data for Finetuner:

run = finetuner.fit(
    model='bert-base-cased',
    run_name='bert-my-own-run',
    train_data='path/to/some/data.csv',
)

More information on providing your own training data is found in the Prepare Training Data section of the Finetuner documentation.

Next steps

Read our documentation to learn more about what Finetuner can do.

Support

Join Us

Finetuner is backed by Jina AI and licensed under Apache-2.0.

We are actively hiring AI engineers and solution engineers to build the next generation of open-source AI ecosystems.