curiosity-ai/catalyst

Error creating Japanese NLP Pipeline

Opened this issue · 4 comments

Describe the bug
Trying to load the Pipeline for the Japanese model/language results in a MessagePackSerializationException This is on NET6 on windows 10.

To Reproduce

  1. add the japanese model nuget
  2. run the following code
Catalyst.Models.Japanese.Register();
var nlp = await Pipeline.ForAsync(Language.Japanese);

the second line will error with th exception in the Additional context

Expected behavior
Create the Pipeline without error and be able to perform NLP on japanese text.

Additional context

MessagePack.MessagePackSerializationException : Error occurred while reading from the stream.
---- System.NullReferenceException : Object reference not set to an instance of an object.

  Stack Trace: 
MessagePackSerializer.DeserializeAsync[T](Stream stream, MessagePackSerializerOptions options, CancellationToken cancellationToken)
StorableObjectV2`2.LoadAsync(Stream stream)
AveragePerceptronTagger.LoadAsync(Stream stream)
<<Register>b__0_7>d.MoveNext()
--- End of stack trace from previous location ---
ResourceLoader.LoadAsync[T](Assembly assembly, String resourceFile, Func`2 loader)
<<Register>b__0_0>d.MoveNext()
--- End of stack trace from previous location ---
StorableObject`2.LoadDataAsync()
AveragePerceptronTagger.FromStoreAsync(Language language, Int32 version, String tag)
Pipeline.ForAsync(Language language, Boolean sentenceDetector, Boolean tagger)

Hi @gilliganc , thanks for reporting it. This is probably because we don't have an AveragePerceptronTagger model for Japanese. I'll investigate how to improve this.

Meanwhile you can create a "Tokenizer" only pipeline

thanks i think i need more than the tokenizer as i was trying to port some existing code from python to dotnet that was based around spacy to see if i could improve the performance and integrate it easier. Based on what the person that wrote the original code i need more than the tokeniser. We are trying to detect the keywords in the japanese text and the nouns i don't think just the the tokenizer would help right?

Is this being worked on? I still have this error. It's definitely the AveragePerceptronTagger (I'm getting NullReferenceException).

Does the tokenizer even work properly?

Is there a reason this spacy model has been ported without it? The Japanese model is pretty much useless right now if I can't get anything to work. How soon can this be fixed?

It looks like spacy haven't used Averaged Percepton Taggers since pre-version 2.0. They now use neural networks (matrix multiplication). Are all the Catalyst models based on APTs?

@CodeRabbit957 we've not updated the tagger as we're also ourselves not using it anymore in our app... In any case, Catalyst would need to incorporate a proper CJK tokenizer such as https://github.com/leungwensen/cjk-tokenizer to be able to correctly handle Japanese. If you're up for the challenge, PRs are welcome!