Spacy integration example is broken
iibrahimli opened this issue · 0 comments
Describe the bug
The example script is currently broken with the current latest versions of spacy and pysbd. Adding a pipe to spacy model throws an exception. Moreover, sentences are not split when extra spaces are present before/after them.
To Reproduce
Steps to reproduce the behavior:
Run the examples/pysbd_as_spacy_component.py
script.
This will result in the exception being thrown. If the pipeline addition is fixed and the code proceeds further, the sentence segmentation will not match the output of using pysbd
module without spacy.
Expected behavior
Expected the code to not raise an exception, and the output to be correct.
Example:
Input text - "Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S. I live in New York."
Expected output - ["Hello world.", "My name is Mr. Smith.", "I work for the U.S. Government and I live in the U.S.", "I live in New York."]
Actual output - ["Hello world. My name is Mr. Smith. I work for the U.S. Government and I live in the U.S.", "I live in New York."]
Additional context
Versions tested:
spacy==3.0.6
pysbd==0.3.4
The error thrown when trying to add pipe to the spacy model:
ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function pysbd_sentence_boundaries at 0x7f0498c62158> (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.