/MSc_bioinformatics_thesis

This repository contains my implementation of some deep clustering models written for my MSc thesis.

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Implementation of deep clustering models for metabolomics data.

GPLv3 license made-with-latex made-with-python 3.9 tensorflow 2.9.1 keras 2.9.0

This repository contains my implementation of some deep clustering models I wrote for my MSc thesis. It also contains the code written to train and evaluate the models on multiple datasets. The original thesis report can be read here, but the document is in Catalan (I plan on translating it into English, but I have not set a deadline).

The original objective of the thesis was to implement a VAE based deep clustering model and apply it on metabolomics data, then compare the results with more established techniques. I expected that the found clusters would lend themselves to some biological interpretation.

The VAE based model did not perform well, which prompted me to try other models, also based on the autoencoder architecture. The deep learning models I implemented are:

All the models where implemented using Keras with Tensorflow, using Python. For the training process, I leveraged the virtual machines provided by Paperspace Gradient (paid subscription).

File structure

  • models.py: Python module that contains my implementation of all the deep clustering models.
  • draw_embeddings.py Python module that contains some functions to draw graphical representations of the embeddings and cluster assignments.
  • clustering_metrics.py: Python module that contains some functions to evaluate the performance of the models.
  • thesis_report Folder that contains my full thesis report, both the PDF file and the latex source.
  • MNIST Folder that contains the Jupyter Notebooks I wrote to train and evaluate the models on the MNIST data set. Also contaings the metrics and cluster assignments on CSV files.
  • ExposomeChallenge Same as above, for the Exposome Data Challenge Event data set.
  • PrivateDataset Same as above, for the DCH-NG data set.
  • _learning_keras Folder that contains some Jupyter Notebooks I wrote while training myself on the use of Keras and Tensorflow.

Required software

The DNNs provided here are implemented using Python and Keras over Tensorflow.

The implementation of the models defined in the module models.py requires the following python packages:

  • python >= 3.9
  • tensorflow >= 2.9.1
  • keras >= 2.9.0
  • numpy

To reproduce the provided notebooks, you will also need:

  • matplotlib
  • numpy
  • pandas
  • scikit-learn
  • scipy
  • seaborn

Abstract

I implemented several deep clustering models based on the Autoencoder architecture with the aim of evaluating their performance in metabolomics datasets. Using the MNIST dataset and two metabolomic datasets, I evaluated the performance of several variations of the VAE, DEC and VaDE architectures using internal and external validation metrics to measure clustering quality. I compared the results with more established methods such as K-means, GMM and agglomerative clustering. I found found that the VAE architecture is not conducive to good clustering quality. The clusters obtained with the DEC, Vade and consolidated techniques show a high level of overlap with each other, but yield low performances according to the validation metrics. The DEC model excels over the rest in the internal validation metric, but is very sensitive to the initialization parameters. The VaDE model achieves similar results to the rest of the techniques, and has the added value of having generative capacity, which could be used in artificial data augmentation techniques. The multivariate distribution of the covariates (as well as that of the most variable metabolites) shows a differential distribution by the clusters obtained, although the results are not clear. This suggests a possible biological interpretation of the clusters, but it will be necessary to study it in more depth to draw conclusions.