UKPLab/sentence-transformers

CrossEncoder .rank condition error in CrossEncoder.py

saeeddhqan opened this issue · 6 comments

I get the following error when I use .rank method:

File /usr/local/lib/python3.12/dist-packages/sentence_transformers/cross_encoder/CrossEncoder.py:551, in CrossEncoder.rank(self, query, documents, top_k, return_documents, batch_size, show_progress_bar, num_workers, activation_fct, apply_softmax, convert_to_numpy, convert_to_tensor)
    548     if return_documents:
    549         results[-1].update({"text": documents[i]})
--> 551 results = sorted(results, key=lambda x: x["score"], reverse=True)
    552 return results[:top_k]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I use sentence_transformers v3.3.0.

A snippet:

cross_encoder = sentence_transformers.CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
cross_encoder.rank(query, docs)

@saeeddhqan can you also share the structure of your query and docs?

cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])

@saeeddhqan it works for me

from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base")

cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])

Response

[{'corpus_id': 0, 'score': np.float32(0.5175216)},
 {'corpus_id': 2, 'score': np.float32(0.4488596)},
 {'corpus_id': 1, 'score': np.float32(0.43759635)}]

@JINO-ROHIT The issue seems to be model specific.

@saeeddhqan thanks for opening! The CrossEncoder class wraps around the AutoModelForSequenceClassification class from transformers, and those models can predict logits for $n$ classes per sequence (query-document pairs in this case). The CrossEncoder.predict method will call this underlying model and return all predictions. For amberoad/bert-multilingual-passage-reranking-msmarco, that's 2:

from sentence_transformers.cross_encoder import CrossEncoder

cross_encoder = CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [[-1.2904704  1.1504961]]
print(cross_encoder.config.num_labels)
# 2

whereas for a lot of CrossEncoder models (e.g. cross-encoder/stsb-distilroberta-base) it's just 1:

from sentence_transformers.cross_encoder import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [0.51752156]
print(cross_encoder.config.num_labels)
# 1

Beyond that, the CrossEncoder.rank method internally calls CrossEncoder.predict and then expects that each query-document pair results in 1 value (i.e. that the model only has 1 label). What's missing is a raise ValueError in CrossEncoder.rank if self.config.num_labels != 1, because if there's multiple values per prediction, then it's unclear which one denotes the similarity. In short: CrossEncoder models with more than 1 label can't be used with CrossEncoder.rank at the moment, only with CrossEncoder.predict, and then you can do the ranking yourself if you know which value corresponds with similarity.

  • Tom Aarsen

ahh okay makes sense, i can help with a PR for this if youre not working on this 😊

That would be much appreciated!

  • Tom Aarsen