curiosity-ai/catalyst

Trained data model as the Stanford NLP .Net

Closed this issue · 1 comments

Hi,

Currently, trained data do not recognise organisation properly, for .e.g following text:
"Centre for Dermatology Research, Manchester Academic Health Science Centre and NIHR Manchester Biomedical Research Centre, University of Manchester, Manchester, UK"

While running Entity recognition model using catalyst, it does not correctly recognise the "Centre for Dermatology Research" as organisation and for many other.

Same full text if we pass in the stanford nlp .net demo site:
http://corenlp.run/
we get the correct recognised organisation along with city, state and country.

If we can get the same trained default data model as in Stanford nlp .net in the catalyst train data it will be a great feature to the catalyst NER project.

Currently Stanford NLP .Net do not support .Net Core and there is no plan for it at-least not as of now.

I would definitely add that am literally in love with the catalyst project, its so simple and easy to run.
Wish i can have some way to train data just like the stanford nlp .net.

Keep up the good work, it really helps the people like me who is noob in machine learning programming.

Cheers,
Syd

Hi @code-noober!

Thanks for the feedback! I agree that the current entity recognition model is not ideal - partially due to the model itself (it's based on the older NER model from Spacy v1, but there have been significant improvements since then to it), and partially due to lack of training data (the WikiNER data we use is quite noisy for these tasks).

If you manage to find some good datasets for training NER for these tasks, happy to take a look and extend the training data used currently by Catalyst.