Franck-Dernoncourt/NeuroNER

Bypassing spacy for deployment of a pretrained model

asmundur opened this issue · 0 comments

I'm using NeuroNER in connection with other NER models, and I have bit a problem I can't seem to figure out now.

This setup does very much depend on identical tokenization, and using spacy for the others is out of the question. Implementing the tokenizer we're using seems to time consuming and complicated to be feasible.

For training, this was easy to accomplish, since providing the data in CONLL format does bypass the tokenizer, as the CONLL format is tokenized already, and it works as well for the deploy set provided. So the functionality I require is in the project.

Consider these lines in the documentation

from neuroner import neuromodel
nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)
nn.predict('SOME STRING TO BE TAGGED')

I need something that works very similar to this, that is, a predict function that can be called repeatedly without having to initialize the model again, but instead of taking just a plain string, it takes a file in the CONLL format.

Is it possible to do this directly without modifying the source project? and if not, how difficult would it be to modify the project so that this functionality which already is in NeuroNER, would be exposed so it can be called as a member function of the NeuroNER class?

The most obvious and simplest solution to this that I have thought of is to reverse engineer the predict functionality in the training process, and implement a new predict_conll function that would work similar to how predict works now. But I highly doubt that this is the best approach, or even feasible.

I understand that this is related to these two issues, but not exactly the same problem.
#30
#126