This is a curated list of medical datasets best-suited to the purpose of learning and teeaching. It is not intended to be comprehensive list of all medical datasets - rather, the intention is to include data that has been vetted for both quality and ease-of-access (ie. open-source) and which are therefore well-suited to educational purposes. The rationale behind this repository is explained in more detail here.
- The hope is for this to become a crowd-sourced resource and therefore contributions are warmly invited. Please either add an Issue, submit a pull request or ping an email to hi at chrislovejoy.me.
- Links to high-quality tutorials utilising the dataset will be included along with the dataset description, where such tutorials and walk-throughs are available.
- This list exists in collaboration with other lists of medical datasets which have different approaches and focusses.
CheXpert: 224,316 chest radiographs from 65,240 patients. Each report was labeled for the presence of 14 observations as positive, negative, or uncertain.
The National CT Colonography Trial: 825 cases of CT colonography imaging with accompanying spreadsheets that provide polyp descriptions and their location within the colon segments.
fastMRI: Several thousand knee MRIs. Requires application for access (online form).
Automatic Non-rigid Histological Image Registration (ANHIR) challenge dataset: 50+ histological sets of whole slide images
EchoNet-Dynamic: 10,030 echocardiogram videos.
MIMIC-III: Anonymized critical care EHR database on 38,597 patients and 53,423 ICU admissions. Requires registration.
S2ORC: The Semantic Scholar Open Research Corpus: 81.1M English-language academic papers spanning many academic disciplines.
The Cancer Genome Atlas Program: over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data from over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
Parkinson Speech Dataset: 26 types of sound recordings taken from 20 Parkinson's patients and 20 health patients.
- EHR-derived data
- public health / population health data