/hypoxia-prediction-cancer

ML project that aims to predict the oxygen level condition of cancer cells based on gene expression patterns.

Primary LanguageJupyter Notebook

Hypoxia Prediction from scRNA-seq Data

The project was conducted by Group 3: Beatrice Citterio, Irene Colombo, Giovanni De Muri, Mattia Martino, and Sandro Mikautadze

Table of Contents

Introduction

Motivation

Cancer is a very complex and heterogeneous disease. What this means is that different cells can show different profiles, ranging from cellular morphology deformations to varying gene expression patterns. Although treatments are available for different types of mutations, there is still a need to improve our understanding of cancer cells' environment, which can significantly influence the behavior and response to treatment. In particular, what plays a crucial role is oxygen availability in the cell's environment, as the exposure of cancer cells to different levels of oxygen can lead to changes in gene expression patterns. For example, cells exposed to low oxygen levels, namely hypoxic cells, can exhibit altered gene expression patterns, which can contribute to tumor progression and treatment resistance. This is because they are in a stressed environment, and therefore the cells activate some genes that might help them survive and become unresponsive to treatment.

Therefore, this environmental condition further complicates the diagnosis and treatment of the disease, as it can result in different treatment outcomes for the same type of cancer. In this context, having the possibility to analyze the genetic sequence of cancer cells to gain information about their environment can hugely help in identifying the most aggressive and resistant cells. Hence, developing methods to identify and characterize cancer cells' environment accurately is critical to improving cancer diagnosis and treatment.

The Research Question

The research question we aim to address is the following: can we determine from the genetic sequence whether a cell was exposed to an environment with low or high oxygen levels?

In particular, our goal is to predict hypoxia or normoxia status from single-cancer-cell RNA-seq data.

Materials

In order to answer this question, we will work in a simplified environment. In particular, we have at our disposal single-cancer-cell RNA-sequencing (scRNA-seq) data coming from breast cancer. The cells that we are going to analyze were derived originally from tumors of female patients, and then they were grown in two cultures: MCF7 and HCC1806. The culture corresponding to the cell line MCF7 was given estrogen, while the one corresponding to HCC1806 was given other growth factors. We are assuming that the cell lines that we obtain are a good representation of the tumor cells.

The samples in the cell line correspond to cells and the features of sequenced genes. We will be working with two given types of sequencing techniques: SmartSeq and DropSeq. For each technique and for each cell we have the following data:

  • SmartSeq for MCF7 and HCC1806:
    • Metadata, with information about the samples
    • Unfiltered data, so the actual sequencing data, with no filter nor normalization.
    • Filtered and normalized data, so the preprocessed data ready for training
  • DropSeq for MCF7 and HCC1806:
    • Filtered and normalized data

In addition, there's also a test set, for each colture and for each technique, that will be used for evaluation.

How To Read the Project

Here we are going to summarize the order in which the files must be read in order to understand our choices, thus read this section carefully. What follows is the exact file name order, with a brief explanation of the content:

  • 00-SmartSeq folder:

    1. HCC1806_unfiltered.ipynb: it contains the exploratory data analysis, data preprocessing, and unsupervised learning of the unfiltered data of the HCC1806 cell line.

    2. HCC1806_train.ipynb: it contains a brief EDA and unsupervised learning section, supervised learning for classification, and analysis of the results for the HCC1806 cell line.

    3. MCF7_unfiltered.ipynb: same as (i), but for the MCF7 cell line.

    4. MCF7_train.ipynb: same as (ii), but for the MCF7 cell line.

All the techniques and choices are carefully motivated and explained in (i) and (ii). Therefore, for the MCF7 cell line, we kept our comments much more compact, without explaining once again our decisions.

  • 01-DropSeq folder:

    1. HCC1806_dropseq.ipynb: EDA and training for the HCC1806 cell line.
    2. MCF7_dropseq.ipynb: same as (v), but MCF7 cell line.

The additional-data folder contains data and pictures used during the project.

The original data used for the project are private and will not be published.

We stress once again the importance of reading the files in the order presented above. Not doing so will very likely result in a big confusion for the reader.

Why Such a Structure?

On the one hand, we understand that structuring the project with this format might slow down the process of reading and understanding all at once. On the other, we believe that it's the setup that gives the most credit to our work, as we have put a lot of effort into doing the best work we could, despite the time and domain-knowledge constraints, while keeping the comments short enough as to be both well descriptive and not excessively long.

In addition, notice that we decided not to perform the training on the processed data from the unfiltered files, but the only learning that we did was on the given training set. This was made to avoid lengthening the project and to keep the results consistent for both sequencing techniques.

Brief Comments on the Results

Before reading this, we advise you to see the rest of the project first.

First of all, let's compare the cell lines. In general, after analyzing and training all our datasets, we can conclude that supervised training gives us better performances than unsupervised methods.

For SmartSeq, we see that the methods that behave best for HCC1806 are the Random Forest (98.9% accuracy) and Logistic Regression (98.4% accuracy), and for MCF7 are Logistic Regression (100% accuracy) and Linear SVM (100% accuracy). Overall, we see that the classification on MCF7 has higher accuracy with all methods. This could be due to the fact that it is a bigger dataset, it is more balanced between hypoxic and normoxic cells, and also because it is less sparse.

Now, if we compare the features that we have found to be more relevant in both the cell lines (76 for MCF7 and 98 for HCC1806), we get that we don't find much similarity for the genes per se (i.e. only 16 genes are shared). However, when doing pathway analysis, we see that both glycolysis, hypoxia, and mTORC1 are over-represented pathways.

Instead, in DropSeq, for the MCF7 we get that Linear SVM behaves the best (97.7% accuracy), while for HCC1806 we still get that Linear SVM and SVM behave the best (95.4%).

Notice that we have decided not to perform pathway analysis for the DropSeq dataset, since as a sequencing technology it is less precise than SmartSeq.

When comparing the two sequencing technologies, we see that, overall, the performance of DropSeq is lower than the one of SmartSeq. This may be due to sparsity, and the fact that there may be still a lot of noise on the dataset. Moreover, notice that hypoxic/normoxic cells are not balanced here.

So overall, the best classification seems to be given by Linear SVM.

Test Results

After evaluating performance and interpretability, we decided to use the following models: Linear SVM for MCF7, both SmartSeq and DropSeq, and for HCC1806 DropSeq, while we kept Random Forest for HCC1806 SmartSeq.

Using the aforementioned models we checked our performance on the test sets. The results are in the form of confusion matrices, i.e.

Predicted Negative Predicted Positive
Actual Negative True Negative False Positive
Actual Positive False Negative True Positive

and associated scores.

HCC1806 SmartSeq:

0 1
0 25 0
1 1 19
  • Accuracy: 0.98
  • Precision: 1.0
  • Recall: 0.95
  • F1: 0.97

HCC1806 DropSeq:

0 1
0 1339 115
1 115 2102
  • Accuracy: 0.94
  • Precision: 0.95
  • Recall: 0.95
  • F1: 0.95

MCF7 SmartSeq:

0 1
0 32 0
1 0 31
  • Accuracy: 1.0
  • Precision: 1.0
  • Recall: 1.0
  • F1: 1.0

MCF7 DropSeq:

0 1
0 3146 74
1 69 2117
  • Accuracy: 0.97
  • Precision: 0.97
  • Recall: 0.97
  • F1: 0.97

So, in conclusion, we had a really good performance with all of our models, and a perfect classifier for the dataset MCF7 SmartSeq.