finetuner: A Python repository from gaohuan2015

Task-oriented finetuning for better embeddings on neural search

Fine-tuning is an effective way to improve performance on neural search tasks. However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.

Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and infrastructure in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models, making them production-ready without buying expensive hardware.

📈 Performance promise: enhance the performance of pre-trained models and deliver state-of-the-art performance on domain-specific neural search applications.

🔱 Simple yet powerful: easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight freezing, dimensionality reduction, hard-negative mining, cross-modal models, and distributed training.

☁ All-in-cloud: train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud without worrying about resource availability, complex integration, or infrastructure costs.

Documentation

Benchmarks

Model	Task	Metric	Pretrained	Finetuned	Delta
BERT	Quora Question Answering	mRR	0.835	0.967	15.8%
BERT	Quora Question Answering	Recall	0.915	0.963	5.3%
ResNet	Visual similarity search on TLL	mAP	0.110	0.196	78.2%
ResNet	Visual similarity search on TLL	Recall	0.249	0.460	84.7%
CLIP	Deep Fashion text-to-image search	mRR	0.575	0.676	17.4%
CLIP	Deep Fashion text-to-image search	Recall	0.473	0.564	19.2%
M-CLIP	Cross market product recommendation (German)	mRR	0.430	0.648	50.7%
M-CLIP	Cross market product recommendation (German)	Recall	0.247	0.340	37.7%

_{^{All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.}}

Install

Make sure you have Python 3.7+ installed. Finetuner can be installed via pip by executing:

pip install -U finetuner

If you want to encode docarray.DocumentArray objects with the finetuner.encode function, you need to install "finetuner[full]". This includes a number of additional dependencies, which are necessary for encoding: Torch, Torchvision and OpenCLIP:

pip install "finetuner[full]"

⚠️ Starting with version 0.5.0, Finetuner computing is performed on Jina AI Cloud. The last local version is 0.4.1. This version is still available for installation via pip. See Finetuner git tags and releases.

Get Started

The following code snippet describes how to fine-tune ResNet50 on the Totally Looks Like dataset. You can run it as-is. The model and training data are already hosted in Jina AI Cloud and Finetuner will download them automatically. (NB: If there is already a run called resnet50-tll-run, choose a different run-name in the code below.)

import finetuner
from finetuner.callback import EvaluationCallback

finetuner.login()

run = finetuner.fit(
    model='resnet50',
    run_name='resnet50-tll-run',
    train_data='tll-train-data',
    callbacks=[
        EvaluationCallback(
            query_data='tll-test-query-data',
            index_data='tll-test-index-data',
        )
    ],
)

This code snippet describes the following steps:

Log in to Jina AI Cloud.
Select backbone model, training and evaluation data for your evaluation callback.
Start the cloud run.

You can also pass data to Finetuner as a CSV file or a DocumentArray object, as described in the Finetuner documentation.

Depending on the data, task, model, hyperparameters, fine-tuning might take some time to finish. You can leave your jobs to run on the Jina AI Cloud, and later reconnect to them, using code like this below:

import finetuner

finetuner.login()

run = finetuner.get_run('resnet50-tll-run')

for log_entry in run.stream_logs():
    print(log_entry)

run.save_artifact('resnet-tll')

This code logs into Jina AI Cloud, then connects to your run by name. After that, it does the following:

Monitors the status of the run and prints out the logs.
Saves the model once fine-tuning is done.

Using Finetuner to encode

Finetuner has interfaces for using models to do encoding:

import finetuner
from docarray import Document, DocumentArray

da = DocumentArray([Document(uri='~/Pictures/your_img.png')])

model = finetuner.get_model('resnet-tll')
finetuner.encode(model=model, data=da)

da.summary()

When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the finetuner.encode method will return a np.ndarray of embeddings, instead of a docarray.DocumentArray:

import finetuner
from docarray import Document, DocumentArray

images = ['~/Pictures/your_img.png']

model = finetuner.get_model('resnet-tll')
embeddings = finetuner.encode(model=model, data=images)

Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file.

A CSV file is a tab or comma-delimited plain text file. For example:

This is an apple    apple_label
This is a pear      pear_label
...

The file should have two columns: The first for the data and the second for the category label.

You can then provide a path to a CSV file as training data for Finetuner:

run = finetuner.fit(
    model='bert-base-cased',
    run_name='bert-my-own-run',
    train_data='path/to/some/data.csv',
)

More information on providing your own training data is found in the Prepare Training Data section of the Finetuner documentation.

Next steps

Take the walkthrough and submit your first fine-tuning job.
Try out different search tasks:

Read our documentation to learn more about what Finetuner can do.

Support

Use Discussions to talk about your use cases, questions, and support queries.
Join our Slack community and chat with other Jina AI community members about ideas.
Join our Engineering All Hands meet-up to discuss your use case and learn Jina AI new features.
- When? The second Tuesday of every month
- Where? Zoom (see our public events calendar/.ical) and live stream on YouTube
Subscribe to the latest video tutorials on our YouTube channel

Join Us

Finetuner is backed by Jina AI and licensed under Apache-2.0.

We are actively hiring AI engineers and solution engineers to build the next generation of open-source AI ecosystems.

gaohuan2015/finetuner