Not able to load custom language

Question

Not able to load custom language

seansaito opened this issue 3 years ago · 7 comments

Hi, we're using pke for Japanese keyword extraction with a custom library (Ginza)
https://megagonlabs.github.io/ginza/

Until version 1.8.1, pke worked fine. However, with the recent major release (literally hours ago), we're unable to load and we're unable to extract keywords:

[2022-03-08 04:02:13,409] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/base.py", line 117, in load_document
    for i, sentence in enumerate(self.sentences):
TypeError: 'NoneType' object is not iterable
"""[2022-03-08 04:02:13,408] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.
[2022-03-08 04:02:13,408] {readers.py:65} ERROR - No spacy model for 'ja_ginza' language.

Is it possible for you to provide a link to the pke 1.8.1 release? Seems like you have deleted it from this repo. Thanks!

Answer 1 · 2022-03-07T19:26:36.000Z

Also, even if I choose the default Japanese spacy model it fails to load:

[2022-03-08 04:24:44,928] {readers.py:65} ERROR - No spacy model for 'ja_core_news_sm' language.
[2022-03-08 04:24:44,928] {readers.py:66} ERROR - A list of available spacy models is available at https://spacy.io/models.

Answer 2 · 2022-03-07T19:31:00.000Z

Sorry I broke many things as I did a lot of refactoring to simplify further development and ease maintaining.

I think the issue comes from the fact that japanese is missing from lang.py, I'll do some tests and get back to you.

Answer 3 · 2022-03-07T20:03:44.000Z

So it seems that the issue was simply the japanese langcode missing from lang.py. It is now fix in fede063

To test, I installed the japanese spacy model using:

python -m spacy download ja_core_news_sm

and then run the following python code with success:

import pke

sample = """富士山（、英語: Mount Fuji）は、山梨県（富士吉田市、南都留郡鳴沢村）と、
静岡県（富士宮市、富士市、裾野市、御殿場市、駿東郡小山町）に跨る活火山である[注釈 3]。
標高3776.12 m、日本最高峰（剣ヶ峰）[注釈 4]の独立峰で、
その優美な風貌は日本国外でも日本の象徴として広く知られている。"""

extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input=sample, language='ja')
extractor.candidate_selection()
extractor.candidate_weighting()
print(extractor.get_n_best(n=10))

which produces

[('富士 山 （', 0), ('英語', -4), ('mount fuji ）', -6), ('山梨 県 （ 富士吉田 市', -11), ('南都留 郡 鳴沢村 ）', -17), ('静岡 県 （ 富士宮 市', -24), ('富士 市', -30), ('裾野 市', -33), ('御殿場 市', -36), ('駿東 郡 小山 町 ）', -39)]

You should also be able to use a custom spacy model using the spacy_model parameter as:

import pke
import spacy

nlp = spacy.load("your model")
extractor = pke.unsupervised.FirstPhrases()
extractor.load_document(input="some japanese text", language='ja', spacy_model=nlp)

Please let me know if this feature works (AFAIK it is untested).

Best,

f.

Answer 4 · 2022-03-07T20:07:55.000Z

Thanks!

Unfortunately, custom spacy models can fail when you try to add the "sentencizer" to the pipeline for a model which already has one (which is the case for our custom japanese model):

"""
Traceback (most recent call last):
  File "/home/devuser/src/ml/keywords/keyword_extractor.py", line 94, in do_yake
    extractor.load_document(input=text, language="ja_ginza", normalization=None, spacy_model=nlp)
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/base.py", line 93, in load_document
    sents = parser.read(text=input, spacy_model=spacy_model)
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/pke/readers.py", line 70, in read
    nlp.add_pipe('sentencizer')
  File "/home/devuser/venvs/pipeline_extract_keywords/lib/python3.8/site-packages/spacy/language.py", line 771, in add_pipe
    raise ValueError(Errors.E007.format(name=name, opts=self.component_names))
ValueError: [E007] 'sentencizer' already exists in pipeline. Existing names: ['tok2vec', 'parser', 'attribute_ruler', 'ner', 'morphologizer', 'compound_splitter', 'bunsetu_recognizer', 'sentencizer']
"""

Answer 5 · 2022-03-07T20:42:28.000Z

Hum, I just removed the sentencizer for custom models in 3cfe17b

Answer 6 · 2022-03-08T02:30:22.000Z

@boudinfl Got it, thanks!

By the way, could you let me know which commit points to version 1.8.1 exactly? Want to keep this commit for the sake of backwards compatibility. Thanks a lot for looking into this!

Answer 7 · 2022-03-08T08:42:43.000Z

pke 1.8.1 would be f651015