allenai/scibert

BIO Coding Schemes for NER are hardcoded

Closed this issue · 2 comments

Hi, thanks for sharing scibert, it's helping me a lot (together with scispacy)!

I noticed that not all datasets included in this package for NER are formatted according to the same coding scheme. It seems that bc5cdr and sciie are IOB1 (single token spans are I-Entity), while JNLPBA and NCBI-disease are BIOUL (single token spans are B-Entity).

Maybe this should also be a variable in the training script instead of hardcoded, and the README could direct users to change it according to the dataset?

Thanks,
Dan

PS: Curiously, I used the training script for bc5cdr with the wrong coding scheme (the default BIOUL in the ner.json config) and the performance was very close to what you report in the paper ("test_f1-measure-overall": 0.8881). Any idea why? Is it irrelevant as long as dev and test are in the same format? Could the model learn a coding 'fix' during training?

Have you successfully run the project without any error? I think there is something wrong with the code, could you give me a hand?

I don't think these 2 coding schemes should result in meaningfully different results. You can see here in the AllenNLP reader that the default behavior is to map it IOB1 to BIOUL anyways:
https://github.com/allenai/allennlp/blob/30c4271f7f04babb1cb546ab017a104bda011e7c/allennlp/data/dataset_readers/conll2003.py#L138