microsoft/BiomedParse

Full BiomedParseData Access

Closed this issue · 4 comments

Hello,

Thanks a lot for the great work.

Three questions:

  1. The released dataset is only a subset of the original datasets. Could you share the percentage of this subset per dataset?
  2. Could you share the complete processed datasets?
  3. Could you let us know which script we should use and how to use it to generate the processed datasets from the original datasets?

Thanks in advance for your help and support.

  1. For every dataset we released, it is 100% of that dataset in our collection.
  2. As there are some datasets that are too large, we started with the core subset. If there is any specific dataset you are interested, please let me know and I will try to upload or find a better way to share.
  3. The preprocessing is very dataset specific, as they each have different format. If there is a specific dataset you are interested, I am happy to share the script that is closest to your need.

Hope these are helpful for you!

Thanks a lot for your quick and detailed reply.

To confirm, the model was trained only on the subset that was released in the hugging face hub here:
https://huggingface.co/datasets/microsoft/BiomedParseData
Correct?

The datasets uploaded as of now were all used for training. The complete list of datasets is in Supplementary Table 1 chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://static-content.springer.com/esm/art%3A10.1038%2Fs41592-024-02499-w/MediaObjects/41592_2024_2499_MOESM1_ESM.pdf, where we specified the training subset.

Perfect thanks a lot for the your explanation and for your reply.