postgresml/postgresml

[feature] Support SetFit: few-shot fine-tuning of Sentence Transformers, works with about 8 samples per class.

stargazer33 opened this issue · 0 comments

It would be great to add SetFit to postgresml

See
https://pypi.org/project/setfit/
https://huggingface.co/docs/setfit

SetFit for text classification is different from other libraries: Usually, to train/fine-tune a model you need thousands of samples per class. In this example
https://postgresml.org/docs/open-source/pgml/guides/llms/fine-tuning
the "train" part of IMDB dataset contains 25K rows. There are 2 classes, so 12500 samples per class.

Now I'm quoting the SetFit documentation

It achieves high accuracy with little labeled data - for instance, with only 8 labeled examples per class on the Customer Reviews sentiment dataset

The code where they train a classifier - again they classifying film reviews (nothing really new here)
is here
https://huggingface.co/docs/setfit/main/quickstart#training

the sample_dataset function will sample only 8 samples for each class.

Compare this:
12500 samples per class
vs
8 samples per class with SetFit

In the real life, in many cases, you can collect... 50 samples per class and use SetFit to train a model.
Situations where you have tens of thousands of samples are quite rare.
Let's support SetFit.