race labels for MIMIC-CXR ?
Opened this issue · 5 comments
Hi,
I wondered how to obtain the race labels for MIMIC - CXR ?
I do have access to https://physionet.org/content/mimic-cxr/2.0.0/ and https://physionet.org/content/mimic-cxr-jpg/2.0.0/ but could not locate where you get the white/asian/black labels?
Like how to create the modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv
that you use in the training code?
Thanks for any help,
Best,
Robin
Hi Robin,
Race labels can be found here
Under the core
directory, in the admissions
dataset. From there you can join the subject_id
with the CXR subject_id
.
Let us know if we can help with anything else!
ah amazing thanks that clears it up! Other questions, am I understading correctly there is some code that preprocesses MIMIC-CXR and that is not in this repo? Like, one cannot just follow:
- Fork/Download the GitHub repository.
- Fetch the data from the data URLs for open-source datasets and drop them in the data folder.
- Run the corresponding training code and save the trained model in the models folder.
for MIMIC-CXR, because https://github.com/Emory-HITI/AI-Vengers/blob/cbdf593b0d852e3078abbc72cf92aad03496511d/training_code/CXR_training/MIMIC/MIMIC_resnet34_race_detection_2021_06_29.ipynb starts from some dataframe that you have created with some code that is not in this repo?
That's correct. At the moment you would have to join the csv dataframes and make your own train-val-test splits, like what we did with modified_viewposition_race_4-race-ethnicity_60-10-30_split_with_gender_age_ver_b.csv
I see.
One more question that came up:
Did you try to handle subjects with multiple values for ethnicity in any way? For example, following code shows there are 168 subjects that had been entered both as BLACK/AFRICAN AMERICAN and WHITE and 2489 subjects with OTHER and WHITE:
admissions_df = pd.read_csv(os.path.join(mimic_folder, 'admissions.csv'))
ethnicity_df = admissions_df.loc[:,['subject_id', 'ethnicity']].drop_duplicates()
v = ethnicity_df.subject_id.value_counts()
subject_id_more_than_once = v.index[v.gt(1)]
ambiguous_ethnicity_df = ethnicity_df[ethnicity_df.subject_id.isin(subject_id_more_than_once)]
grouped = ambiguous_ethnicity_df.groupby('subject_id')
grouped.aggregate(lambda x: "_".join(sorted(x))).ethnicity.value_counts()
Wow! Great catch! As far I know we were unaware of this multiple ethnicity problem. I will look into this and test using these changes. I suspect it could improve performance by reducing noise from mislabeled patients.
Thank you!