Metada format for dataloader
Closed this issue · 6 comments
Hello,
I intend to use json file for loading my data. In the loading part there is a lack of information for sample_metadata. I looked through nlst.py folder under loading folder, but if you can share the sample_metadata.json file with some random numbers and format I would appreciate it.
Best Regards
Hi,
Thanks for reaching out!
The file should be a list of dictionaries for each patient. The dictionaries are organized hierarchically:
patient metadata > list of exams > dictionary per series > series information
Here is more specifically what it looks like. Included as well an example JSON with all the relevant dictionary keys and random/empty values: nlst_sample.json
{
"pid": "XYZ", # PATIENT ID
"split": "test", # SPLIT
"accessions": [ # LIST of EXAMS
{ # DICT for EXAM 1
"exam": "exam_id_timepoint", # EXAM ID + TIMEPOINT
"accession_number": "exam_id", # EXAM ID
"screen_timepoint": "timepoint", # TIMEPOINT
"date": "YYYYMMDD", # EXAM DATE
"image_series": { # DICT of SERIES
"series_id1": { # DICT for SERIES 1
"paths": ["/path/to/slice1.png", "/path/to/slice2.png", "/path/to/slice3.png"], # LIST of PATHS to DICOMs/PNGs
"slice_location": [3,1,2], # SLICE LOCATIONS from DICOM METADATA
"slice_number": [3,1,2], # SLICE NUMBERS from DICOM METADATA
"img_position": [3,1,2], # IMAGE POSITION from DICOM METADATA
"pixel_spacing": [0.703125, 0.703125], # PIXEL SPACING from DICOM METADATA
"slice_thickness": 2.5, # SLICE THICKNESS from DICOM METADATA
"series_data": { # DICT of SERIES METADATA
"reconfilter": ["STANDARD"], # RECONSTRUCTION FILTER from NLST
"reconthickness": [2.5], # RECONSTRUCTION THICKNESS from NLST
"manufacturer": [1], # MANUFACTURER from NLST
... # OTHER METADATA from NLST
},
},
"series_id2": {} # DICT for SERIES 2
},
"abnormalities": { # DICT of ABNORMALITIES from NLST
"sct_ab_desc": [51], # SCT ABNORMALITY DESCRIPTION from NLST
"sct_ab_num": [1], # SCT ABNORMALITY NUMBER from NLST
... # OTHER ABNORMALITY DATA
},
},
{}, # DICT for EXAM 2
{} # DICT for EXAM 3
],
"pt_metadata": { # DICT of PATIENT METADATA from NLST
"race": [X], # RACE
"cigsmok": [0], # CIGARETTE SMOKING
"candx_days": [45], # DAYS TO CANCER DIAGNOSIS
...
}
}
Thank you very much for detailed explanation. That will work for us. Furthermore, I would like to ask about the file and the folder order of your NLST dataset. If you can share that too I would highly appreciate.
Hi,
I'm unsure what is meant by file and folder order.
Let me clarify for you. I want to try the train.py code and I need to load the dataset. Normally in NLST datasets the file orders are very complicated. I will show one dataset folder structure that I found on the Internet below.
Here I think that the first folder shows PID number, the second one shows Exam ID, and the last one contains CT scan results. I wonder what do these folders look like in your NLST dataset.
Hi,
It follows a similar structure, but the directory structure shouldn't matter if the JSON is configured as above. What matters is that every series has the list of paths to the PNG/DICOM images, and those are then loaded during training. In the most simplified setting, a sample in a training batch just requires the image paths and the label.
Thank you for your help, Peter! This info will work for me.