Franck-Dernoncourt/NeuroNER

Usage of CoNLL-03 values

svanhvitlilja opened this issue · 3 comments

Hi! We're working on a named entity recognizer for Icelandic, using NeuroNER and an annotated training corpus.

As there is no support for Icelandic in Spacy or the Stanford NLP tools, we ran into a problem when running NeuroNER on our data in brat format (error appears when tokenizing using spacy in brat_to_conll.py)

Our question is: Can we bypass using Spacy altogether by formatting our data in conll-03 ourselves, using available Icelandic NLP resources? And to what extent are the conll values used in NeuroNER?

Hi,
You need to check this code : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 and understand what it does. Then replace it with a Icelandic tokenizer/sentence segmenter

It's not easy to do it could take a lot of time

Thanks, we changed the source code to use our own tokenizing method :)