Request for preprocessing script
ArnaudFerre opened this issue · 5 comments
Hi,
I would like to reproduce the BioSyn results on the NCBI Disease Corpus (and to do an ablation study).
I was able to use your core method (+lowercased) on this corpus (and some others), but without the resolution of composite mentions and acronyms, I only get around 0.801 of top 1 accuracy on your published 0.911.
Could you please send me a pre-processing script?
Kind regards,
Arnaud
I have also a minor question: how do you calculate the score for mentions that should be normalized by more than one concept? (I find that represents only 0,4% of case in the NCBI Disease Corpus)
It seems that you give 1 point if the predicted concept is one of the corrects, right?
Hi, we're planning to upload preprocessing code soon.
As for mentions that have more than one concept, there are two cases.
Case1) If it's a composite mention, we first split it into single mentions and consider it correct if the predictions of all single mentions are correct.
Case2) If it's not a composite mention, yes, we consider it correct if one of the concepts are correctly predicted.
Hi,
Thank you for your answer.
Yes, I had indeed doubts about case 2.
Sorry for this question, but do you have a more accurate estimatation for the upload of the preprocessing script?
Given your results, I would have appreciated to have BioSyn in my study, but I unfortunately have deadlines to meet...
If you can't provide this script in the state that suits you quickly, don't you have another possibility?
For example, what does the query_preprocess.py do for the TAC-ADR-2017 data? Only acronyms resolution (+lowercasing and punctuations removing)?
In the worst case, I can try to use Ab3P directly, and in that case, I would just need to know what you used to resolve compound mentions.
Finally, regarding the accuracies I gave you, I would like to have your feedback on their plausibility. Would you also have observed such a gain with the acronyms/composite mentions resolutions you applied?
Kind regards,
Arnaud
Hello, thank you for your patience.
I just uploaded the preprocessing scripts for the NCBI-disease dataset.
I've observed that the abbreviation/composite resolutions are very important steps for normalizing mentions better.
Hi,
Thank you very much for the scripts and for your answer.
Kind regards,
Arnaud