TimSchopf/KeyphraseVectorizers

Newer Spacy transformer model backends fail

Closed this issue · 1 comments

I use spacy's transformer model for other purposes (such as NER), so re-using the same model made sense.
Looks like Spacy made some tweaks to their syntax which are breaking KeyBERT's spacy backend.

Sample code:

from keybert import KeyBERT
from spacy import load

nlp = load("en_core_web_trf", exclude=['tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer'])
kw_model = KeyBERT(model=nlp)

text = "This is a test sentence."

keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)
print(keywords)

Expected behavior:
prints [("test", ...)]

Observed behavior:

Traceback (most recent call last):
  File "...\anaconda3\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 84, in embed
    self.embedding_model(doc)._.trf_data.tensors[-1][0].tolist()
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...\test.py", line 9, in <module>
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words='english', top_n=1, use_mmr=True)  
  File "...\envs\env\lib\site-packages\keybert\_model.py", line 195, in extract_keywords
    doc_embeddings = self.model.embed(docs)
  File "...\envs\env\lib\site-packages\keybert\backend\_spacy.py", line 88, in embed
    self.embedding_model("An empty document")
AttributeError: 'DocTransformerOutput' object has no attribute 'tensors'

Package versions:
cupy-cuda11x 12.3.0
curated-tokenizers 0.0.9
curated-transformers 0.1.1
en-core-web-trf 3.7.3
keybert 0.8.5
keyphrase-vectorizers 0.0.13
safetensors 0.4.4
scikit-learn 1.5.1
scipy 1.13.1
sentence-transformers 3.0.1
spacy 3.7.5
spacy-alignments 0.9.1
spacy-curated-transformers 0.2.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
spacy-transformers 1.3.5
thinc 8.2.5
tokenizers 0.15.2
transformers 4.36.2

i'm so sorry, I didn't have enough coffee and meant to post this on the keyBERT repo.