This repository contains code and data for a deep learning (DL)-based computational framework designed to discover and validate prognostic biomarkers for renal cell carcinoma (RCC). Using this framework, we identified Phytanoyl-CoA Dioxygenase Domain Containing 1 (PHYHD1) as a novel tumor suppressor and potential biomarker for RCC, offering insights into RCC growth mechanisms and potential therapeutic targets.
Renal cell carcinoma (RCC) presents significant challenges in early diagnosis and has limited response to traditional treatments. This study explores the role of PHYHD1, identified as a promising prognostic biomarker via a DL-based framework that interprets complex tissue patterns. PHYHD1, predominantly expressed in normal tissues, was found to be downregulated in RCC tumors, suggesting its function as a tumor suppressor. Restoring PHYHD1 expression in RCC cells reduced tumor growth, revealing its potential as a therapeutic target. Further analyses identified Ceramide Transport Protein 1 (CERT1) as a downstream target of PHYHD1, forming the PHYHD1-CERT1 regulatory axis involved in RCC cell proliferation.
pan_cancer_α_KG_dependent_analysis/
├── datasets/
│ └── preprocess/
│ └── CPTAC_data_preprocess.py
│ ├── CPTAC_cancer_type_mapping.json
│ ├── CPTAC_data_raw.csv
│ ├── CPTAC_dataset.csv
│ ├── TCGA_cancer_type_mapping.json
│ └── TCGA_dataset.csv
├── notebooks/
│ ├── TCGA_analysis.ipynb
│ └── CPTAC_analysis.ipynb
└── saves/
├── TCGA_best_model.pth
└── CPTAC_best_model.pth
-
datasets/preprocess: Contains raw datasets and preprocessing scripts, including:
CPTAC_data_preprocess.py
: Preprocesses CPTAC data for DL model input.- JSON files: Mapping files for cancer types across datasets.
- CSV files: Raw and processed datasets for RCC analysis from CPTAC and TCGA datasets.
-
notebooks: Main analysis code in Jupyter notebooks.
TCGA_analysis.ipynb
: Notebook focusing on TCGA dataset for model training and evaluation.CPTAC_analysis.ipynb
: Notebook focusing on CPTAC dataset for model training and evaluation.
-
saves: Directory containing trained model files in
.pth
format for reproducibility and further analysis.
To replicate this study, install Anaconda and set up an environment with Python 3.9.
-
Create and activate the environment:
conda create -n cancer_analysis python=3.9 conda activate cancer_analysis
-
Install required packages:
# PyTorch and dependencies conda install pytorch torchvision torchaudio -c pytorch # Other required packages conda install shap matplotlib pandas numpy scikit-learn # Install Captum (for model interpretability) and pyensembl using pip pip install captum pyensembl # Jupyter for running notebooks conda install jupyter
-
Set up pyensembl (if genomic data is analyzed):
import pyensembl data = pyensembl.EnsemblRelease(104, species="homo_sapiens") data.download() data.index()
Navigate to the project directory and start Jupyter Notebook:
jupyter notebook
Open TCGA_analysis.ipynb
and CPTAC_analysis.ipynb
in the notebooks
folder to begin analysis.
-
Data Preprocessing:
- Ensure that CPTAC and TCGA dataset in the datasets folder.
-
Model Training and Analysis:
- Open
TCGA_analysis.ipynb
andCPTAC_analysis.ipynb
innotebooks
to perform pan-cancer analysis, including Caputum and SHAP analysis. - These notebooks will load or train models, saving the results in the
saves
directory.
- Open
Please credit this work if it assists in your research or academic projects.