CrossEncoder .rank condition error in CrossEncoder.py
saeeddhqan opened this issue · 6 comments
I get the following error when I use .rank method:
File /usr/local/lib/python3.12/dist-packages/sentence_transformers/cross_encoder/CrossEncoder.py:551, in CrossEncoder.rank(self, query, documents, top_k, return_documents, batch_size, show_progress_bar, num_workers, activation_fct, apply_softmax, convert_to_numpy, convert_to_tensor)
548 if return_documents:
549 results[-1].update({"text": documents[i]})
--> 551 results = sorted(results, key=lambda x: x["score"], reverse=True)
552 return results[:top_k]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I use sentence_transformers v3.3.0.
A snippet:
cross_encoder = sentence_transformers.CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
cross_encoder.rank(query, docs)
@saeeddhqan can you also share the structure of your query and docs?
cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])
@saeeddhqan it works for me
from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base")
cross_encoder.rank('docx', ['doc1', 'doc2', 'doc3'])
Response
[{'corpus_id': 0, 'score': np.float32(0.5175216)},
{'corpus_id': 2, 'score': np.float32(0.4488596)},
{'corpus_id': 1, 'score': np.float32(0.43759635)}]
@JINO-ROHIT The issue seems to be model specific.
@saeeddhqan thanks for opening! The CrossEncoder
class wraps around the AutoModelForSequenceClassification
class from transformers
, and those models can predict logits for CrossEncoder.predict
method will call this underlying model and return all predictions. For amberoad/bert-multilingual-passage-reranking-msmarco, that's 2:
from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("amberoad/bert-multilingual-passage-reranking-msmarco", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [[-1.2904704 1.1504961]]
print(cross_encoder.config.num_labels)
# 2
whereas for a lot of CrossEncoder models (e.g. cross-encoder/stsb-distilroberta-base) it's just 1:
from sentence_transformers.cross_encoder import CrossEncoder
cross_encoder = CrossEncoder("cross-encoder/stsb-distilroberta-base", device='cpu', max_length=256)
print(cross_encoder.predict([('docx', 'doc1')]))
# [0.51752156]
print(cross_encoder.config.num_labels)
# 1
Beyond that, the CrossEncoder.rank
method internally calls CrossEncoder.predict
and then expects that each query-document pair results in 1 value (i.e. that the model only has 1 label). What's missing is a raise ValueError
in CrossEncoder.rank
if self.config.num_labels != 1
, because if there's multiple values per prediction, then it's unclear which one denotes the similarity. In short: CrossEncoder models with more than 1 label can't be used with CrossEncoder.rank
at the moment, only with CrossEncoder.predict
, and then you can do the ranking yourself if you know which value corresponds with similarity.
- Tom Aarsen
ahh okay makes sense, i can help with a PR for this if youre not working on this 😊
That would be much appreciated!
- Tom Aarsen