Usage of CoNLL-03 values
svanhvitlilja opened this issue · 3 comments
Hi! We're working on a named entity recognizer for Icelandic, using NeuroNER and an annotated training corpus.
As there is no support for Icelandic in Spacy or the Stanford NLP tools, we ran into a problem when running NeuroNER on our data in brat format (error appears when tokenizing using spacy in brat_to_conll.py)
Our question is: Can we bypass using Spacy altogether by formatting our data in conll-03 ourselves, using available Icelandic NLP resources? And to what extent are the conll values used in NeuroNER?
Hi,
You need to check this code : https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/brat_to_conll.py#L20 and understand what it does. Then replace it with a Icelandic tokenizer/sentence segmenter
It's not easy to do it could take a lot of time
Thanks, we changed the source code to use our own tokenizing method :)