shangjingbo1226/AutoNER

Question about train/dev/test data

caolingyu opened this issue · 1 comments

Hi Jingbo,

Thanks for providing the tool, which is very useful.

I have a question about the data. How did you split the data into train/dev/test sets? I find some sentences in raw_text that are also in truth_dev.ck and truth_test.ck. Does this mean that some of the dev/test data are in the training set as well?

In addition, I also wonder whether you evaluated performance on the auto-annotated dataset or human-annotated dataset? You mention that dev/test files are optional, I think in this case, there are no human-annotated data for evaluation.

Thanks a lot.

You can add as much as raw texts that you have. It will help the performance.

In the paper, we evaluated against the human-annotated dataset. In this repo, when dev/test files are missing, it will use the generated train file as the dev file to choose the checkpoint. We haven't systematically tested this feature yet. For a better and more stable model selection, I suggest you have a separate, human annotated dev file.