/MRKR

Diverse Knee Radiograph Dataset

Primary LanguageJupyter Notebook

Emory Knee Radiograph (MRKR) Dataset


Data description of the Emory Knee Radiograph (MRKR) Dataset hosted by Open Data on AWS.

An example notebook is included.

Summary

The Emory Knee Radiograph (MRKR) dataset is a large, demographically diverse collection of 503,261 knee radiographs from 83,011 patients, 40% of which are African American. This dataset provides imaging data in DICOM format along with detailed clinical information, including patient- reported pain scores, diagnostic codes, and procedural codes, which are not commonly available in similar datasets. The MRKR dataset also features imaging metadata such as image laterality, view type, and presence of hardware, enhancing its value for research and model development. MRKR addresses significant gaps in existing datasets by offering a more representative sample for studying osteoarthritis and related outcomes, particularly among minority populations, thereby providing a valuable resource for clinicians and researchers.

Dataset Size

The total dataset is 2.3 TB and includes DICOMs (2.3 TB) and seven CSV files (2.7 GB) each containing clinical and metadata.

The dataset will be hosted with Open Data on AWS.

Publication

Will be available soon

License

CC-BY-SA

MRKR Dataset Patient Summary Statistics

Total Patients 83,011 (100%)
Gender
      Female 51,175 (61.6%)
      Male 31,836 (38.4%)
Age, years
      Mean 56.6 (std. +/- 16.6)
      Median 58
Race demographics
      White 36,927 (44.5%)
      Black 33,503 (40.4%)
      Asian 2,893 (3.5%)
      Unknown/Unreported 8,751 (10.5%)
      Other 937 (1.1%)
Ethnicity
      Hispanic 2,501 (3.0%)
      Non-Hispanic 66,378 (80.0%)
      Unknown/Unreported 14,132 (17.0%)
Clinical outcomes
      Arthroplasty 14,843 (17.9%)



Dataset Structure

Filename: MRKR_CPT.csv
File size: 178 MB
Total rows: 6,216,190
Description: This table contains information regarding all CPT codes for a patient and corresponding dates.
Field Name Data Type Description
empi_anon Integer (8 digits) Unique patient identification number (83,011 patients)
cpt_code String (5 characters) Current Procedural Terminology code used in coding of medical services and procedures for billing (7,166 CPT codes)
cpt_group_modifier String Used to provide further information regarding service or procedure. Most CPT codes do not include modifier data. If there is modifier data, it is often used to indicate the laterality of a procedure (left or right). There can be multiple modifiers for a single CPT code entry.
date_anon Date Date of when the associated procedure or service occurred.
age_at_procedure Integer Age when the procedure was performed.

Filename: MRKR_CPT_dictionary.csv
File size: 754 KB
Total rows: 7,166
Description: A lookup table between CPT codes and corresponding descriptions.
Field Name Data Type Description
cpt_code String (5 characters) Current Procedural Terminology code used in the coding of medical services and procedures for billing.
cpt_description String Description of the procedure. There are some unique CPT codes that share the same description.

Filename: MRKR_ICD.csv
File size: 1.7 GB
Total rows: 21,956,056
Description: ICD9 and ICD10 diagnoses for patients with corresponding dates. Certain diseases of interest are indicated by binary flags to ease data cleaning.
Field Name Data Type Description
empi_anon Integer (8 digits) Unique patient identification number (83,011 unique patients)
ICD9 String International Classification of Diseases - 9 (12,418 unique codes)
ICD10 String International Classification of Diseases - 10 (26,963 unique codes)
date_anon Date Date of when the diagnosis code was entered.
age_at_dx Integer Age when the diagnosis was recorded.
DX_LINE String Primary, Secondary, Active, Not Recorded, Resolved, Canceled, Inactive.
DX_ICD_SCOPE String Billing Diagnosis, Discharge Diagnosis, Admitting Diagnosis, Referring Diagnosis, Not Recorded, Reason For Visit, Problem List, Working Diagnosis, Other Diagnosis, Final, Pre-Op Diagnosis, Post-Op Diagnosis, Principal Diagnosis, Suggested Billing.
autoimmune Binary If ICD code corresponds to auto-immune disease such as rheumatoid arthritis, juvenile arthritis, gout, etc.
diabetes Binary If ICD code corresponds to type I or type II diabetes.
hypertension Binary If ICD code corresponds to hypertension.
joint_infection Binary If ICD code corresponds to a knee joint infection.
knee_osteoarthritis Binary If ICD code corresponds to knee osteoarthritis.
knee_osteomyelitis Binary If ICD code corresponds to knee osteomyelitis.
obesity Binary If ICD code corresponds to obesity.
nicotine_use Binary If ICD code corresponds to nicotine dependence.
trauma_lower_extremity Binary If ICD code corresponds to lower extremity trauma.

Filename: MRKR_ICD_dictionary.csv
File size: 1.9 MB
Total rows: 25,209
Description: Lookup table for ICD9 (International Classification of Diseases) and ICD10 codes and corresponding descriptions.
Field Name Data Type Description
ICD9 String ICD9 code.
ICD10 String ICD10 code.
DX_NAME String Diagnosis name or description.

Filename: MRKR_pain.csv
File size: 137 MB
Total rows: 4,975,933
Description: Contains information on self-reported pain scores by patients during any encounter, including outpatient, emergency, and perioperative. Pain scores related to knees are curated.
Field Name Data Type Description
empi_anon Integer (8 digits) Unique patient identification number (83,011 unique patients)
pain_location String Raw, uncurated strings of pain locations entered by staff. Approximately 75% of entries are blank.
knee_pain Binary Curated using regular expressions to identify if the pain_location is definitely knee related.
pain_score Integer 0 - 10 pain score.

Filename: MRKR_demographics.csv
File size: 4.5 MB
Total rows: 83,011
Description: Patient demographics, indexed at the patient level.
Field Name Data Type Description
empi_anon Integer (8 digits) Unique patient identification number.
sex Nominal string [male, female] - Patient sex.
race Nominal string [African American or Black, American Indian or Alaskan Native, Asian, Caucasian or White, Multiple, Native Hawaiian or Other Pacific Islander, Unknown] - Patient self-reported race.
ethnicity Nominal string [Hispanic patients, Non-Hispanic patients, Unknown, Unreported] - Patient reported ethnicity.

Filename: MRKR_image_metadata.csv
File size: 210 MB
Total rows: 503,261
Description: Contains relevant public DICOM metadata tags that may be helpful for identifying images. Patient and exam identifiers are replaced with de-identified versions in this table and within DICOM files. Other Non-PHI containing metadata tags that are not in this table are left intact within DICOM files. Fields containing PHI such as patient name, addresses, or referring physician are removed from this table and DICOM files. For data curation, the below fields were modified or added.
Field Name Data Type Description
empi_anon Integer (8 digits) De-identified patient identification number.
StudyInstanceUID_anon String De-identified Study UID, shared between all images in the same study.
SeriesInstanceUID_anon String De-identified Series UID, shared between all images in the same series.
SOPInstanceUID_anon String De-identified SOP Instance UID which corresponds to a single DICOM image.
img_height Integer Image pixel height.
img_width Integer Image pixel width.
laterality Nominal string [R: Right, L: Left, B: Bilateral, -1: Unknown or not present] - Laterality of the image, as inferred by DL model.
view_position Nominal string [F: Frontal, L: Lateral, S: Sunrise, I: Internal Oblique, E: External Oblique] - Anatomical projection of radiograph, as inferred by DL model.
horizontal_flip Binary Indicates if the patient’s left side was oriented to the left side of the image, which is opposite of typical radiographic orientation, as inferred by DL model.
weight_bearing Binary Indicates if the radiograph was weight-bearing as indicated by a marker and derived by DL model. Not all images in a given exam will be weight-bearing or non-weightbearing.
inverted Binary Indicates whether pixel intensity values are inverted from typical radiographic convention, as inferred by DL model.
arthroplasty Nominal string [R: right, L: left, B: bilateral, NL: unknown (no laterality marker), NaN: no arthroplasty] - Indicates if image contains a knee arthroplasty and its laterality, as derived by DL model.
L_KLG_inference Integer
[0,1,2,3,4,NaN]
KLG score of left knee in a bilateral knee radiograph, inferred by DL model.
R_KLG_inference Integer
[0,1,2,3,4,NaN]
KLG score of right knee in a bilateral knee radiograph, inferred by DL model.
SeriesDescription String DICOM Metadata describing the series.
StudyDescription String DICOM metadata describing the study.
StudyDate_anon Date De-identified date of radiograph.
age_at_exam Integer Age of the patient when the radiograph was performed.
dicom_path String Path to DICOM file.