This work presents the solution proposed by Universidad CEU Cardenal Herrera (ESAI-CEU-UCH) at Kaggle American Epilepsy Society Seizure Prediction Challenge. The proposed solution was positioned as 4th at Kaggle competition.
Different kind of input features (different preprocessing pipelines) and different statistical models are being proposed. This diversity was motivated to improve model combination result.
It is important to note that any of the proposed systems use test set for calibration. The competition allow to do this model calibration using test set, but doing it will reduce the reproducibility of the results in a real world implementation.
This solution uses the following open source software:
- APRIL-ANN toolkit v0.4.0. It is a toolkit for pattern recognition with Lua and C/C++ core. Because this tool is very new, the installation and configuration has been written in the pipeline.
- R project v3.0.2. For statistical computing, a wide spread tool in Kaggle competitions. Packages R.matlab, MASS, fda.usc, fastICA, stringr and plyr are necessary to run the solution.
- GNU BASH v4.3.11, with cp, mv, find, mktemp, sort and tr command line tools.
The solution is prepared to run in a Linux platform with Ubuntu 14.04 LTS, but it could run in other Debian based distributions, but not tested.
Additionally, APRIL-ANN toolkit has been compiled using the Intel MKL library, and it is needed to ensure reproducibility of ANN models in the solution. However, the delivered code revision uses by default ATLAS library, which is open source and standard in Linux systems, but it can also be configured to use Intel MKL library.
The minimum requirements for the correct execution of this software are:
- 6GB of RAM.
- 1.5GB of disk space.
The experimentation has been performed in a cluster of three computers with same hardware configuration:
- Server rack Dell PowerEdge 210 II.
- Intel Xeon E3-1220 v2 at 3.10GHz with 16GB RAM (4 cores).
- 2.6TB of NFS storage.
The solution can be generated by executing bash-scripts located at the repository root folder. In order to understand deeply which features and models are being used, we recommended to read the report.
The configuration of the input data, subjects, and other stuff is in
settings.sh
script. The following environment variables indicate the location
of data and result folders:
DATA_PATH=DATA
indicates where the original data is. It will be organized in subfolders, one for each available subjects, and this subfolders will contain the corresponding MAT files.TMP_PATH=TMP
indicates the folder for intermediate results (feature extraction).MODELS_PATH=MODELS
indicates the folder where models training and results are stored.SUBMISSIONS_PATH=SUBMISSIONS
indicates where test results will be generated.USE_MKL=0
change this flag to indicate that you want to compile APRIl-ANN using Intel MKL library.
All other environment variables are computed depending on these root paths.
Note that all the solution must be executed being in the root path of the git
repository. The list of available subjects depends on the subfolders of
$DATA_PATH
.
It is possible to train and test two selected submissions by executing the
script train.sh
:
$ ./train.sh
It generates intermediate files in $TMP_PATH
folder. First, all the proposed
features are generated to disk:
$TMP_PATH/FFT_60s_30s_BFPLOS
contains FFT filtered features using 1 min. windows.$TMP_PATH/FFT_60s_30s_BFPLOS_PCA/
contains the PCA transformation of (1).$TMP_PATH/FFT_60s_30s_BFPLOS_ICA/
contains the ICA transformation of (1).$TMP_PATH/COR_60s_30s/
contains eigen values of windowed correlation matrices, using 1 min. windows over segments.$TMP_PATH/CORG/
contains eigen values of correlation matrices computed over the whole segment.$TMP_PATH/COVRED/
contains different global statistics computed for each segment.
Besides the features, PCA and ICA transformations are computed for each subject, and the transformation matrices are stored at:
$MODELS_PATH/PCA_TRANS
$MODELS_PATH/ICA_TRANS
This preprocess can be executed without training by using the script
preprocess.sh
.
Once preprocessing step is ready, training of the proposed models starts. The
model results are stored in subfolders of $MODELS_PATH
. This subfolders contain
similar data:
validation_SUBJECT.txt
is the concatenation of cross-validation output, used to optimize the final ensemble.validation_SUBJECT.test.txt
is the test results corresponding to the indicated subject (without CSV header).test.txt
is the concatenation of all test results with the CSV header needed to send it as submission to Kaggle.
The trained systems are stored at folders:
$MODELS_PATH/ANN2P_ICA_CORW_RESULT/
$MODELS_PATH/ANN5_PCA_CORW_RESULT/
$MODELS_PATH/ANN2_ICA_CORW_RESULT/
$MODELS_PATH/KNN_PCA_CORW_RESULT/
$MODELS_PATH/KNN_ICA_CORW_RESULT/
$MODELS_PATH/KNN_CORG_RESULT/
$MODELS_PATH/KNN_COVRED_RESULT/
The final submission is computed by using Bayesian Model Combination (BMC), and will be located at:
$SUBMISSIONS_PATH/best_ensemble.txt
our best result, BMC ensemble of the seven trained systems.$SUBMISSIONS_PATH/best_simple_model.txt
our best simple model result,ANN5_PCA_CORW_RESULT
.
Testing procedure is incorporated in training scripts, but it is possible to
run it using the script test.sh
.
In order to train a new subject, you can use train_subject.sh
script, which
receives as argument the name of the subject:
$ ./train_subject.sh SUBJECT_NAME
It is possible to use the solution with new test data. You just need to deploy the
new test files in the $DATA/SUBJECT
folders and run the test.sh
script:
$ ./test.sh
This script will generate a random output filename using mktemp
command with
the pattern test.XXXXXX.txt
and output folder $SUBMISSIONS_PATH/
.
- ICA seed was fixed after competition, so competition solution is not exactly the same.
Other bugs have been solved in v1.0.
- Contextualized windows for ANNs have a minor bug.
- For error, ICA uses test centers for test, instead of training centers.
- ICA seed was fixed after competition, so competition solution is not exactly the same.