Can't start training with a SentenceLabelDataset due to error `'SentenceLabelDataset' object has no attribute 'column_names'`
HenningDinero opened this issue · 2 comments
If I have 4 targets/clusters say, ["car", "airplane", "boat", "train"] and 1000 sentences for each class, where I want to fine-tune a model to create similar embeddings within each class.
As far as I can understand that is where the SentenceLabelDataset
could be used, or when looking at #2920 the GroupByLabelBatchSampler
, or maybe just use the "usual way" using MNRL and create anchor/positives within each class (although that would create negatives from the same class aswell, which is why I'll try the other apporaches).
Currently I'm trying with SentenceLabelDataset
but theres a struggle with starting the training. Please find below some (pseudo) code:
# For creating the data
def _create_input_example(df: pd.DataFrame):
label = df.name
return InputExample(guid=label, texts=df["Documents"], label=label)
def get_main_data() -> tuple[Dataset, Dataset]:
data = get_data()
le = LabelEncoder()
le.fit(data["TransportType"])
data["label"] = le.transform(data["TransportType"])
training_data = data.query("TrainTest=='TRAIN'")
val_data = data.query("TrainTest=='VALIDATE'")
training_examples = training_data.groupby(["label"])[["Documents"]].apply(_create_input_example)
val_examples = val_data.groupby(["label"])[["Documents"]].apply(_create_input_example)
train_dataset = SentenceLabelDataset(training_examples, samples_per_label=32, with_replacement=True)
val_dataset = SentenceLabelDataset(val_examples, samples_per_label=32, with_replacement=True)
#train_dataloader = NoDuplicatesDataLoader(train_dataset, batch_size=32)
#val_dataloader = NoDuplicatesDataLoader(val_dataset, batch_size=32)
return train_dataset, val_dataset
and the training
steps = 10
train_data, val_data = get_data()
model = SentenceTransformer(
"intfloat/multilingual-e5-small"
)
loss = losses.MultipleNegativesRankingLoss(model, scale=20.0, similarity_fct=util.cos_sim)
training_args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="./sbert_fitted/",
# Optional training parameters:
num_train_epochs=1,
eval_steps=steps,
eval_strategy="steps",
save_strategy="steps",
save_steps=steps,
logging_steps=steps,
)
trainer = SentenceTransformerTrainer(
model=model,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
loss=loss,
)
trainer.train()
This one throws the error AttributeError: 'SentenceLabelDataset' object has no attribute 'column_names'
.
I have also tried using the NoDuplicatesDataLoader
but that gives the error AttributeError: 'NoDuplicatesDataLoader' object has no attribute 'column_names'
.
So 2 questions:
- Is the creation of the labeled-dataset correct i.e simply by creating one
InputExample
for each target where thetexts
just are all the documents for the given target? - Can you see where I'm wrong with the errors
Hello!
Apologies for the confusion here! Sentence Transformers v3 refactored the training approach, and the old approach still exists (for now) so people can still use that if they prefer. What's happening here is that you're using components of the new training approach (SentenceTransformerTrainer, SentenceTransformerTrainingArguments) together with components of the old approach (SentenceLabelDataset, InputExample).
Instead, my recommendation is to move fully to the new approach. Let's start with a loss function. If we have sentences with classes, then we can use one of these loss functions:
(Loss Overview docs)
For example, the BatchAllTripletLoss uses single sentences & a class labels as inputs. The SentenceTransformerTrainer then expects the training/evaluation dataset to be a Dataset from the datasets
package with 2 columns. As explained in the Dataset Format docs, the class labels must be in a column called label
or score
, while the texts can be in a column with any name.
So, we'll get something like:
# E.g. 0: sports, 1: economy, 2: politics
train_dataset = Dataset.from_dict({
"sentence": [
"He played a great game.",
"The stock is up 20%",
"They won 2-1.",
"The last goal was amazing.",
"They all voted against the bill.",
],
"label": [0, 1, 0, 0, 2],
})
Then, we can follow the BatchAllTripletLoss recommendation: Using batch_sampler=BatchSamplers.GROUP_BY_LABEL
. This ensures that each batch contains at least 2 examples per class in each batch - this makes the loss the most useful.
A minimal script should become something like:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
# E.g. 0: sports, 1: economy, 2: politics
train_dataset = Dataset.from_dict({
"sentence": [
"He played a great game.",
"The stock is up 20%",
"They won 2-1.",
"The last goal was amazing.",
"They all voted against the bill.",
],
"label": [0, 1, 0, 0, 2],
})
loss = losses.BatchAllTripletLoss(model)
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="./sbert_fitted/",
# Optional training parameters:
num_train_epochs=1,
batch_sampler=BatchSamplers.GROUP_BY_LABEL,
eval_steps=steps,
eval_strategy="steps",
save_strategy="steps",
save_steps=steps,
logging_steps=steps,
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
Afterwards, you can also experiment with the common MultipleNegativesRankingLoss, but as shown in the Loss Overview, you'll need for example (anchor, positive) pairs
or (anchor, positive, negative) triplets
. This kind of data can be created with your data by going e.g.:
for each class:
for each sentence in class:
anchor = sentence
positive = random sentence from the same class
negative = random sentence from any other class
and then you'll have a bunch of triplets. Then you can use BatchSamplers.NO_DUPLICATES
because it can be bad if a batch contains the same text multiple times. There's a decent chance that this form of training results in better performance - I can't say for sure.
- Tom Aarsen
Thank you very much!
Yes, I might've mixed v2 and v3 (I had some v2 training scripts that I tried to adapt to v3, and I might've forgotten somehting here and there).
I'll give both of your suggestions a go and see how it goes :)