CrossEncoder .rank condition error in CrossEncoder.py

Question

CrossEncoder .rank condition error in CrossEncoder.py

saeeddhqan opened this issue 17 days ago · 6 comments

I get the following error when I use .rank method:

File /usr/local/lib/python3.12/dist-packages/sentence_transformers/cross_encoder/CrossEncoder.py:551, in CrossEncoder.rank(self, query, documents, top_k, return_documents, batch_size, show_progress_bar, num_workers, activation_fct, apply_softmax, convert_to_numpy, convert_to_tensor)
    548     if return_documents:
    549         results[-1].update({"text": documents[i]})
--> 551 results = sorted(results, key=lambda x: x["score"], reverse=True)
    552 return results[:top_k]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I use sentence_transformers v3.3.0.

A snippet:

cross_encoder = sentence_transformers.CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
cross_encoder.rank(query, docs)

Answer 1 · 2024-12-09T15:45:40.000Z

@saeeddhqan can you also share the structure of your query and docs?

Answer 2 · 2024-12-10T06:08:22.000Z

cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])

Answer 3 · 2024-12-10T08:21:58.000Z

@saeeddhqan it works for me

from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base")

cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])

Response

[{'corpus_id': 0, 'score': np.float32(0.5175216)},
 {'corpus_id': 2, 'score': np.float32(0.4488596)},
 {'corpus_id': 1, 'score': np.float32(0.43759635)}]

Answer 4 · 2024-12-10T08:51:22.000Z

@JINO-ROHIT The issue seems to be model specific.

@saeeddhqan thanks for opening! The CrossEncoder class wraps around the AutoModelForSequenceClassification class from transformers, and those models can predict logits for $n$ classes per sequence (query-document pairs in this case). The CrossEncoder.predict method will call this underlying model and return all predictions. For amberoad/bert-multilingual-passage-reranking-msmarco, that's 2:

from sentence_transformers.cross_encoder import CrossEncoder

cross_encoder = CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [[-1.2904704  1.1504961]]
print(cross_encoder.config.num_labels)
# 2

whereas for a lot of CrossEncoder models (e.g. cross-encoder/stsb-distilroberta-base) it's just 1:

from sentence_transformers.cross_encoder import CrossEncoder

cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [0.51752156]
print(cross_encoder.config.num_labels)
# 1

Beyond that, the CrossEncoder.rank method internally calls CrossEncoder.predict and then expects that each query-document pair results in 1 value (i.e. that the model only has 1 label). What's missing is a raise ValueError in CrossEncoder.rank if self.config.num_labels != 1, because if there's multiple values per prediction, then it's unclear which one denotes the similarity. In short: CrossEncoder models with more than 1 label can't be used with CrossEncoder.rank at the moment, only with CrossEncoder.predict, and then you can do the ranking yourself if you know which value corresponds with similarity.

Tom Aarsen

Answer 5 · 2024-12-10T09:02:13.000Z

ahh okay makes sense, i can help with a PR for this if youre not working on this 😊

Answer 6 · 2024-12-10T09:39:45.000Z

That would be much appreciated!

Tom Aarsen