UKPLab/sentence-transformers

Best way to log specific metrics e.g cross-entropy/accuracy during training

HenningDinero opened this issue · 5 comments

Say I have the following setup

 training_args = SentenceTransformerTrainingArguments(
        # Required parameter:
        output_dir="./sbert_fitted/",
        # Optional training parameters:
        num_train_epochs=1,
        per_device_train_batch_size=per_device_train_batch_size,
        per_device_eval_batch_size=per_device_eval_batch_size,
        warmup_ratio=0.1,
        learning_rate=learning_rate,
        fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
        bf16=False,  # Set to True if you have a GPU that supports BF16
        # Optional tracking/debugging parameters:
        eval_steps=steps,
        eval_strategy="steps",
        save_strategy="steps",
        save_steps=steps,
        save_total_limit=2,
        logging_steps=steps,
        run_name="sts",  # Will be used in W&B if `wandb` is installed,
        report_to="none",
        dataloader_drop_last=True,
        load_best_model_at_end=True,
        push_to_hub=True,
        hub_model_id=hub_model_id,
        hub_token=os.environ["HUGGING_FACE_API_TOKEN"],
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        loss=loss,
        callbacks=[
            EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=1e-4),
            tb_callbadk,
        ],
    )
    trainer.train()

I have a tensorflow-callback where I can log the train/eval loss like

class MyTensorBoardCallback(TensorBoardCallback):
    def __init__(self, tb_writer):
        self.tb_writer = tb_writer

    def on_log(self, args, state, control, model, logs=None, **kwargs):
        if logs is not None:

            # Log the training loss
            if "loss" in logs:
                self.tb_writer.add_scalars("Loss", {"train": logs["loss"]}, state.global_step)

            # Log the evaluation loss
            if "eval_loss" in logs:
                self.tb_writer.add_scalars("Loss", {"eval": logs["eval_loss"]}, state.global_step)

        self.tb_writer.flush()

tb_writer = SummaryWriter(log_dir=folder_name)

tb_callbadk = MyTensorBoardCallback(tb_writer=tb_writer)

Since various loss-functions for training would produce different losses e.g I assume that

loss_1 = losses.MultipleNegativesRankingLoss(model, scale=10, similarity_fct=util.cos_sim)
loss_2 = losses.MultipleNegativesRankingLoss(model, scale=20, similarity_fct=util.cos_sim)

would produce different losses thus I can't say that since loss_1 < loss_2 for my validation loss then I've trained a better model, but it is just less since the loss-function simply produces lower losses.
I would like to add an additional metric which is the samme for all loss-functions i.e accuracy/cross-entropy that can be logged in the TensborBoard callback. That way I can use that metric for optimizing hyper-parameters.

Is there a way to do that? Right now I think I could use a "training_loss" and a "optimize_loss" e.g

training_loss = losses.MultipleNegativesRankingLoss(model, scale=10, similarity_fct=util.cos_sim)
optimize_loss = losses.MultipleNegativesRankingLoss(model, scale=20, similarity_fct=util.cos_sim)

where I then only change training_loss e.g change the scale from 10 to 1 (and always keep optimize_loss the same) and then use optimize_loss to look at the validation loss across various training_loss settings (I'm struggling being able to parse both losses though).

What is the best way of doing this? And is there a way to write a "custom function" e.g the accuracy/cross-entropy?

Hello!

I think what you're looking for is a form of evaluation that isn't just an evaluation loss, because that can't be compared across different loss functions (or parameters). Luckily, Sentence Transformers supports quite a few out of the box: https://sbert.net/docs/sentence_transformer/training_overview.html#evaluator
Depending on your setup, different evaluator can be interesting (e.g. TripletEvaluator, EmbeddingSimilarityEvaluator, and InformationRetrievalEvaluator are common). I see now that this docs page also misses the very convenient NanoBEIREvaluator, useful if you're training to optimize English general-purpose retrieval performance.

You can pass these to the SentenceTransformerTrainer via evaluator. These will be computed whenever the evaluation loss is computed, and they'll be logged automatically.

  • Tom Aarsen

I stumbled across the evaluators after I've posted the issue, and forgot to close it. Sorry bout that!

Understanding the output of the TripletEvaluator (it is called eval_cosine_accuracy in logs), how is that calculated? Looking into the source-code it seems like it is just the ratio where dist_func(a,p) < dist_func(a,n) (with dist_func being a distance function e.g cosine, a being anchor, p being positive and n being negative).

I'm looking for something a bit more "cross-entropy" ish i.e where get a higher score if the similarity is closer such that if we have two similarity measures between anchor and positive

d1 = 0.5 
d2 = 0.99

where both of them are correct i.e they are closer to the anchor then the negative then d2 should yield a greater score.
A simple way would just be the dot-product between the similarity and the boolean "is closer vector" i.e similarity * is_closer where is_closer is 1 if the positive is closer than than the negative, else 0.

is_true = np.array([1,0,1,0])
sim = np.array([0.8, 0.4, 0.4, 0.8])
score = is_true @ sim # 0.8+0.4 = 1.2

this could furthermore be divided by the number if elements.

Is it doable? And is that a scoring-function you would like to have added (i'll be more than happy to try to open a PR with it)

Indeed, you have a good understanding of how the TripletEvaluator calculates its values.
We don't have a CE-based evaluator currently because we often don't strictly care about what the similarity values are (or what their range is), but more whether the most relevant pairs have the highest similarity. With other words, this is what the TripletEvaluator, or the Spearman Correlation from the IR Evaluator, are good for.

Having said that, people have different requirements or needs, so that's why it's possible to subclass the base Evaluator and make your own. I'd definitely recommend doing that, I think you'll be able to set up a nice evaluator fairly quickly. If you u. I don't think that I'd include it as an evaluator in Sentence Transformers however, because of the aforementioned reasoning.

If you set the greater_is_better on init and the primary_metric during the __call__, then the model card generation will be able to nicely include this evaluation in the generated model card as well. It'll be included both in Metrics and in Training Logs.

  • Tom Aarsen

I'll have a look into the SentenceEvaluator and see if that works (the TripletLoss might be sufficient). Thanks again for the great help you provide! :)