yuzhimanhua/Multi-BioNER

How to output only one result file during prediction?

zhongxiangboy opened this issue · 1 comments

When using the pre-training model provided for prediction, five result files are output (which seem to correspond to the five datasets used for training).

So, how to output only one result file?

Do I need to integrate all five data sets into one, and then use the model trained by the integrated data to predict?

Yes, when you have N training datasets, there will be N output files corresponding to the N datasets. This is because we are doing multi-task learning with each dataset as a task. Note that these N output files may have conflicts (e.g., the same token may be predicted as S-GENE in output 1 but S-CHEMICAL in output 2). Outputting only 1 file (with conflicts resolved) is beyond the scope of this project.

Merging all training sets into one cannot work because it will introduce lots of false-negative training samples. For example, the first training set may only have GENE entities, then all CHEMICAL entities in the first training set will be labeled as "O".

To achieve the goal you are expecting, as far as I know, you may refer to the following paper:

Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets.
paper: https://aclanthology.org/D18-1306.pdf
code: https://github.com/ngreenberg/em-crf