BIO Coding Schemes for NER are hardcoded
Closed this issue · 2 comments
Hi, thanks for sharing scibert, it's helping me a lot (together with scispacy)!
I noticed that not all datasets included in this package for NER are formatted according to the same coding scheme. It seems that bc5cdr and sciie are IOB1
(single token spans are I-Entity
), while JNLPBA and NCBI-disease are BIOUL
(single token spans are B-Entity
).
Maybe this should also be a variable in the training script instead of hardcoded, and the README could direct users to change it according to the dataset?
Thanks,
Dan
PS: Curiously, I used the training script for bc5cdr with the wrong coding scheme (the default BIOUL
in the ner.json config) and the performance was very close to what you report in the paper ("test_f1-measure-overall": 0.8881). Any idea why? Is it irrelevant as long as dev and test are in the same format? Could the model learn a coding 'fix' during training?
Have you successfully run the project without any error? I think there is something wrong with the code, could you give me a hand?
I don't think these 2 coding schemes should result in meaningfully different results. You can see here in the AllenNLP reader that the default behavior is to map it IOB1 to BIOUL anyways:
https://github.com/allenai/allennlp/blob/30c4271f7f04babb1cb546ab017a104bda011e7c/allennlp/data/dataset_readers/conll2003.py#L138