ETL tool for the National Sleep Research Resource SHHS dataset
To use, simply pip install -r requirements.txt
and you may then use this Python script!
To get configurable parameters available for you when using the script, simply run python shhs-prepper.py --help
usage: shhs-prepper.py [-h] [--batch_size BATCH_SIZE]
[--population_size POPULATION_SIZE] [--interspersed]
[--patient_ids PATIENT_IDS [PATIENT_IDS ...]]
[--path PATH] [--dump_path DUMP_PATH] [--profile]
Turns an shhs dataset into a useful dataframe/CSV set
optional arguments:
-h, --help show this help message and exit
--batch_size BATCH_SIZE
How many EDF samples to batch together. Affects column
size
--population_size POPULATION_SIZE
How many patients should be used?
--interspersed Should different EEG sensors be interleaved into the
same/different row?
--path PATH Path to SHHS folder
--dump_path DUMP_PATH
Path to CSV result dump
--profile Profile mode?
Upon running the script, it creates a CSV file and dumps it to a path (by default, set to results.csv
on the current working directory.
batch_size
controls how many EDF samplings are recorded together as one CSV row. This means that given an EEG signal sampled at 125Hz (125 signal samplings per second), abatch_size
of 125 would store 1 second of EEG data samples per row, with 125 columns representing each sampled signal for a given EEG sensor (there are 2 on the SHHS dataset)interspersed
controls how many different EEG sensors should be interleaves into the same row. If passed (by running the program with the--interspersed
option), the 2 EEG sensors used in the SHHS dataset would be represented in the same row. If not, EEG sensors would be represented into their own individual rowspath
controls the path to the SHHS dataset folder. By default, it assumes the current working directory is the SHHS folderdump_path
controls the path and name of the resulting CSV file
- With
batch_size
set to 2 and--interspersed
, you would have the following columns per each row on your CSV:subject_id
- Patient Identifierin_cohort1
- Did the patient have a Cohort 1 EDFin_cohort2
- Did the patient have a Cohort 2 EDFcohort
- What cohort (1 or 2) was this EEG sampling drawn fromsleep_stage
- What was the patient's sleep stage at this time?eeg1_0
,eeg1_1
,eeg2_0
,eeg2_1
- With--interleaved
, we store botheeg1
andeeg2
sensor data in the same row. Withbatch_size
set to 2, 2 EDF samplings foreeg1
andeeg2
are stored per row.
- With
batch_size
set to 50, you would have the following columns per each row on your CSV:subject_id
- Patient Identifierin_cohort1
- Did the patient have a Cohort 1 EDFin_cohort2
- Did the patient have a Cohort 2 EDFcohort
- What cohort (1 or 2) was this EEG sampling drawn fromsleep_stage
- What was the patient's sleep stage at this time?eeg_0
,eeg_1
...eeg_49
- Withbatch_size
set to 50, 50 EDF EEG samplings are stored per row.eeg_signal
- Tells us whether this represents the 1st or 2nd EEG signal (1
or2
), considering the--interspersed
option was not used.
You most likely won't need to use it, but it runs cProfiler
and to be honest, was an added option as I worked to improve the
profiling of this script to run through across the SHHS monstrosity of a dataset.