/latent-space-discovery

Noble self-supervised adversarial auto-encoder is proposed to extract biologically relevant genes from cancer transcriptomes.

Primary LanguagePythonGNU Affero General Public License v3.0AGPL-3.0

Extracting Biologically Relevant Genes using AFExNet from Cancer Transcriptomes [Paper]

License: CC BY 4.0 contribution python version keras version tensorflow version imblearn version

In this project, we introduce neural network based adversarial autoencoder (AAE) model to extract biologically-relevant features from RNA-Seq data. We also developed a method named TopGene to find highly interactive genes from the latent space. AFExNet in combination with TopGene method finds important genes which could be useful for finding cancer biomarkers.

project_logo_transparent

Getting Started

The following instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See the instruction below:

Prerequisites

The following libraries are required to reproduce this project:

  1. Keras (2.0.6)

  2. Keras-adverserial (0.0.3)

  3. Tensorflow (1.13.1)

  4. Scikit-Learn (0.20.3)

  5. Numpy (1.16.3)

  6. Imbalanced-Learn (0.4.3)

Supports both Python 2.5.0 and Python 3.5.6

Directory Layout

├── results
│   ├── saved_results
│   │   ├── Gene_Analysis_Breast_Cancer.xlsx
│   │   ├── Gene_Analysis_UCEC.xlsx
│   ├── AAE
│   │   ├── aae_encoded.tsv
│   │   ├── aae_sorted_gene.tsv
│   │   ├── aae_weight_distribution.png
│   │   ├── aae_weight_matrix
│   ├── PCA
│   ├── ... # add LDA, SVD etc
├── data
│   ├── data will be stored here
├── feature_extraction
│   ├── AAE
│   │   ├── aae_encoder.h5
│   │   ├── aae_decoder.h5
│   │   ├── aae_discriminator.h5
│   │   ├── aae_history.csv
│   ├── PCA
│   ├──VAE
│   ├── ...
├── README.md
├── figures
│   ├── saved_figures
│   │   ├── Olfactory__Transduction_pathway.png
└── .gitignore

Usage

Run the following to extract features using different autoencoders

main.py

And run the following to extract features when PCA, NMF, FastICA, ICA, RBM etc. are used

main_pca.py

Gene ontology of molecular function was performed using DAVID 6.7 https://david-d.ncifcrf.gov/

More regarding gene ontology http://geneontology.org/docs/ontology-documentation/

Proposed Architecture

weight_analysis_aae

Datasets

Breast Invasive Carcinoma (BRCA)

Molecular Subtypes Number of Patients Label
Luminal A 304 0
Luminal B 121 1
Basal & Triple Negetive 137 2
Her 2 Enriched 43 3
Total Number of Samples (Patients) Total Number of Features (Genes)
605 20439

Validation Data

Uterine Corpus Endometrial Carcinoma (UCEC)

Molecular Subtypes Number of Patients Label
Copy Number High 60 0
Copy Number Low 90 1
Hyper Mutated (MSI) 64 2
Ultra Mutated (POLE) 16 3
Total Number of Samples (Patients) Total Number of Features (Genes)
230 20482

Contribution

If you want to contribute to this project and make it better, your help is very welcome. When contributing to this repository please make a clean pull request.

Acknowledgments