Request for preprocessing script

Question

Request for preprocessing script

ArnaudFerre opened this issue 3 years ago · 5 comments

Hi,

I would like to reproduce the BioSyn results on the NCBI Disease Corpus (and to do an ablation study).
I was able to use your core method (+lowercased) on this corpus (and some others), but without the resolution of composite mentions and acronyms, I only get around 0.801 of top 1 accuracy on your published 0.911.

Could you please send me a pre-processing script?

Kind regards,
Arnaud

Answer 1 · 2021-04-28T19:34:17.000Z

I have also a minor question: how do you calculate the score for mentions that should be normalized by more than one concept? (I find that represents only 0,4% of case in the NCBI Disease Corpus)
It seems that you give 1 point if the predicted concept is one of the corrects, right?

Answer 2 · 2021-04-30T01:46:06.000Z

Hi, we're planning to upload preprocessing code soon.

As for mentions that have more than one concept, there are two cases.
Case1) If it's a composite mention, we first split it into single mentions and consider it correct if the predictions of all single mentions are correct.

Case2) If it's not a composite mention, yes, we consider it correct if one of the concepts are correctly predicted.

Answer 3 · 2021-04-30T17:43:22.000Z

Hi,

Thank you for your answer.

Yes, I had indeed doubts about case 2.

Sorry for this question, but do you have a more accurate estimatation for the upload of the preprocessing script?
Given your results, I would have appreciated to have BioSyn in my study, but I unfortunately have deadlines to meet...

If you can't provide this script in the state that suits you quickly, don't you have another possibility?
For example, what does the query_preprocess.py do for the TAC-ADR-2017 data? Only acronyms resolution (+lowercasing and punctuations removing)?
In the worst case, I can try to use Ab3P directly, and in that case, I would just need to know what you used to resolve compound mentions.

Finally, regarding the accuracies I gave you, I would like to have your feedback on their plausibility. Would you also have observed such a gain with the acronyms/composite mentions resolutions you applied?

Kind regards,
Arnaud

Answer 4 · 2021-04-30T22:51:29.000Z

Hello, thank you for your patience.

I just uploaded the preprocessing scripts for the NCBI-disease dataset.

I've observed that the abbreviation/composite resolutions are very important steps for normalizing mentions better.

Answer 5 · 2021-05-03T17:37:12.000Z

Hi,
Thank you very much for the scripts and for your answer.
Kind regards,
Arnaud