This repository contains routines to download and preprocess data from the EEGBCI dataset [1] available from Physionet [2]. It also contains a PyTorch Dataset container for iterating over records, and a LightningDataModule for interfacing with Pytorch Lightning.
Run the following to install the package into an environment:
git clone https://github.com/neergaard/eegbci-data.git
cd eegbci-data
pip install . -e
This can be done either via command line like so:
python -m eegbci.fetch_data -o data/ \ # Save directory
-n 10 # Ten subjects will be downloaded (if None, all)
or from within python
by
from eegbci.fetch_data import download_dataset
download_dataset('data/', 10)
The underlying download mechanisms are from the MNE toolbox, which will ask to confirm the default location of the dataset.
Again, this can be run using command line:
python -m eegbci.preprocess_data -d data/ \ # Save directory from above
-o data/processed/ \ # Output directory for processed files
--fs 128 \ # Sampling frequency
--tmin -1 \ # Start time of extracted epochs before cue onset
--tmax 4 \ # End time of extracted epochs after cue onset
--subjects None \ # Select specific subjects or select all (if None)
--freq_band 0.3 35. # Lower and upper frequencies for passband filter
or from within Python by
from eegbci.preprocess_data import process_data
process_data(
data_dir='data/',
output_dir='data/processed',
fs=128,
tmin=-1,
tmax=4,
subjects=None,
freq_band=[0.3, 35.]
)
The provided LightningDataModule can handle both download and preprocessing of data:
import argparse
from eegbci.data_containers import EEGBCIDataModule
parser = argparse.ArgumentParser()
parser = EEGBCIDataModule.add_argparse_args(parser)
args = parser.parse_args()
processing_kwargs = dict(
raw_dir="./data",
fs=128.0,
tmin=-1.0,
tmax=4.0,
freq_band=[0.5, 35.0]
)
dm = EEGBCIDataModule(
processing_kwargs=processing_kwargs,
dataset_kwargs=**vars(args)
)
It is also possible to define CV splits and current CV index using supplied arguments, as well as control whether to balance samples in each batch.
A full list of arguments is available via --help
:
python -m eegbci.datamodule.eegbci --help
usage: eegbci.py [-h] [--raw_dir RAW_DIR] [--overwrite] [-d DATA_DIR] [--balanced_sampling] [--cv CV] [--cv_idx CV_IDX] [--eval_ratio EVAL_RATIO] [--max_eval_records MAX_EVAL_RECORDS] [--n_channels N_CHANNELS] [--n_jobs N_JOBS] [--n_records N_RECORDS] [--scaling {robust,standard}]
[--sequence_length SEQUENCE_LENGTH]
optional arguments:
-h, --help show this help message and exit
--raw_dir RAW_DIR Location of raw EDF files.
--overwrite Overwrite H5 files in directory.
dataset:
-d DATA_DIR, --data_dir DATA_DIR
Location of H5 dataset files.
--balanced_sampling Whether to balance batches.
--cv CV Number of CV folds.
--cv_idx CV_IDX Specific CV fold index.
--eval_ratio EVAL_RATIO
Ratio of validation subjects.
--max_eval_records MAX_EVAL_RECORDS
Max. number of subjects to use for eval.
--n_channels N_CHANNELS
Number of applied EEG channels.
--n_jobs N_JOBS Number of parallel jobs to run for prefetching.
--n_records N_RECORDS
Total number of records to use.
--scaling {robust,standard}
How to scale EEG data.
--sequence_length SEQUENCE_LENGTH
Number of sequences in each batch element.
[1] Schalk, G., McFarland, D.J., Hinterberger, T., Birbaumer, N., Wolpaw, J.R. BCI2000: A General-Purpose Brain-Computer Interface (BCI) System. IEEE Transactions on Biomedical Engineering 51(6):1034-1043, 2004.
[2] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.