

Primary LanguagePython


This repository contains PMC-Patients dataset (including patient notes, patient-patient similarity annotations, patient-article relevance annotations, and four downstream task datasets: patient note recognition PNR, patient-patient similarity PPS, patient-patient retrieval PPR, and patient-article retrieval PAR), codes for collection datasets and several baseline models.

See our paper.

PMC OA and PubMed Downloads

For those who only wish to reproduce baseline models, only PubMed abstracts are required for PAR task.

If you have already downloaded PMC OA and PubMed abstracts on your device, skip this step and change relative directory in later steps. Otherwise, download PMC OA and PubMed. Note that file PMC-ids.csv under this directory is also required.


PMC-Patients dataset can be downloaded via this link without any data usage agreement. After downloading, please unzip it and put datasets and meta_data under this directory.

For dataset details, see README.md in datasets and meta_data directory.

All articles used in PMC-Patients are credited in meta_data/PMC-Patients_citations.json.

Dataset Version Logs

Generally PMC-Patients will only be updated incrementally when new data are ready to release, so there's no need to keep an old version and the download link would stay the same.

  • v1.2: Add PMC-Patients_human.json of ground-truth patient notes and their demographics annotated by experts.
  • v1.1: Add citations of articles used in PMC-Patients.


To reproduce construction of PMC-Patients, see code/PMC-Patients_collection/. To try our baselines, see code/downstream_task/.


PMC-Patients dataset is released under CC BY-NC-SA 4.0 License.


      title={PMC-Patients: A Large-scale Dataset of Patient Notes and Relations Extracted from Case Reports in PubMed Central}, 
      author={Zhengyun Zhao and Qiao Jin and Sheng Yu},