/Liverpool-Ion-Switching

Third place solution to the "University of Liverpool - Ion Switching" Kaggle competition

Primary LanguageHTMLApache License 2.0Apache-2.0

Liverpool-Ion-Switching

Third place solution to the Kaggle "Liverpool Ion Switching" competition.

By team Gilles & Kha Vo & Zidmie.

Getting the data

We will be using Chris Deotte his excellent dataset. Download these from Kaggle and put the train_clean.csv and test_clean.csv in a data/ folder. Moreover, we also require a sample_submission.csv file which can be downloaded from the Kaggle competition page. Your directory structure should look as follows:

.
├── data
│   ├── sample_submission.csv
│   ├── test_clean.csv
│   └── train_clean.csv
├── hmm.py
├── main.py
├── processing.py
├── notebooks
│   ├── 1 - Align Channels and Signal.ipynb
│   ├── 2 - Remove Power Line Interference.ipynb
│   ├── 3 - Fit 4-state HMM (Cat 3).ipynb
│   ├── 4 - Setting the Transition Matrix.ipynb
│   ├── 5 - Fit 20-state HMM (Cat 3).ipynb
│   ├── 6 - Custom Forward-Backward (Cat 3).ipynb
│   └── 7 - Prediction Post-Processing (Cat 3).ipynb
├── output
├── LICENSE
├── README.md
└── requirements.txt

Required hardware

This code ran perfectly in Kaggle Notebooks, which has:

  • Ubuntu 18.04.4 LTS (Bionic Beaver)
  • 4 cores: Intel(R) Xeon(R) CPU @ 2.30GHz
  • around 16 GB available RAM

Installing requirements

We provide a requirements.txt file to install the dependencies through pip. Only pandas, numpy, scipy and scikit-learn (the final one only for its f1_score function) are the minimal requirements. All others are needed for the notebooks.

Producing our submission

We provide a main.py script. It will iteratively improve the submission and write away the results to the output/ directory. In each iteration, it removes the power line interference (for which it uses "out-of-fold" predictions) and fits a HMM on the train and test set (batches of 100K). In the first iteration, the power line interference is skipped.

The following can be copy-pasted to a Kaggle notebook (or Google Colab):

!git clone https://github.com/GillesVandewiele/Liverpool-Ion-Switching.git
!mkdir Liverpool-Ion-Switching/data
!cp ../input/data-without-drift/train_clean.csv Liverpool-Ion-Switching/data/train_clean.csv
!cp ../input/data-without-drift/test_clean.csv Liverpool-Ion-Switching/data/test_clean.csv
!cp ../input/liverpool-ion-switching/sample_submission.csv Liverpool-Ion-Switching/data/sample_submission.csv
!cd Liverpool-Ion-Switching; python3 main.py

Notebooks

We provide notebooks that elaborate upon each of the 7 significant steps in our approach:

We align the signal and channels. A simple baseline which just rounds the signal values scores an F1 of 0.9211271639823664

We remove 50 Hz power line interference. This slightly improves the F1 of our baseline: 0.9250468415867467

We show how a naive approach of a Hidden Markov Model already increased the F1 significantly. The category 3 F1 score for our baseline approach is 0.9738199736256037 while a Hidden Markov Model with 4 hidden states scores an F1 of 0.9840563515575094.

We show how you could go about tuning the Ptran which is needed for further steps

We show that, by assumining K independent binary Markov Processes, for data that goes up to K open channels, that we can significantly increase the F1. We show this for category 3 of our data, where we expand a 4x4 transition matrix used to model category 2 to a 20x20 matrix. The achieved F1 score, using only category 3 data, is 0.9866748988341756.

We adapt the forward-backward algorithm to work both faster and slightly more accurate. The F1 score on category 3 data is 0.986794704167445. The impact, in terms of F1 score, is more significant for category 4 and 5 of the data (which were the most important ones).

We convert the posterior probabilities (more probabilities than the number of classes) to a continuous value by taking the dot product between the probabilities and the open channels to which each respective hidden state corresponds to. We then learn thresholds, again in an unsupervised manner, to convert these continuous values to a discrete number of open channels. This increases the F1 for our category 3 data to 0.9869704508621362.

References & Pointers

Contributing

We welcome any kind of contributions. Whether that be cleaning up some of the code, extra documentation, or anything else. Please feel free to open a pull request!