Full BiomedParseData Access

Question

agemagician opened this issue 3 months ago · 4 comments

Hello,

Thanks a lot for the great work.

Three questions:

The released dataset is only a subset of the original datasets. Could you share the percentage of this subset per dataset?
Could you share the complete processed datasets?
Could you let us know which script we should use and how to use it to generate the processed datasets from the original datasets?

Thanks in advance for your help and support.

Answer 1 · 2024-12-07T05:35:59.000Z

For every dataset we released, it is 100% of that dataset in our collection.
As there are some datasets that are too large, we started with the core subset. If there is any specific dataset you are interested, please let me know and I will try to upload or find a better way to share.
The preprocessing is very dataset specific, as they each have different format. If there is a specific dataset you are interested, I am happy to share the script that is closest to your need.

Hope these are helpful for you!

Answer 2 · 2024-12-07T08:35:49.000Z

Thanks a lot for your quick and detailed reply.

To confirm, the model was trained only on the subset that was released in the hugging face hub here:
https://huggingface.co/datasets/microsoft/BiomedParseData
Correct?

Answer 3 · 2024-12-08T01:17:58.000Z

The datasets uploaded as of now were all used for training. The complete list of datasets is in Supplementary Table 1 chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://static-content.springer.com/esm/art%3A10.1038%2Fs41592-024-02499-w/MediaObjects/41592_2024_2499_MOESM1_ESM.pdf, where we specified the training subset.

Answer 4 · 2024-12-08T09:04:18.000Z

Perfect thanks a lot for the your explanation and for your reply.