DGP-AMIO: Disease Gene Predictor based on Attention mechanism and integration of Multi-source gene Interaction networks and Omics
The code implementation of 'Integration of multi-source gene interaction networks and omics data with graph attention networks to identify novel disease genes' (paper can be accessed here). DGP-AMIO is a general disease gene predictor based on graph attention networks (GAT) which integrates gene interaction networks from multiple databases and multiomics.
Our code environment is python 3.6 on Ubuntu 20.04 with CUDA 11.5. The packages of our environment which are dependencies for running DGP-AMIO are provided as follows:
- numpy==1.19.5
- pandas==1.15
- scikit-learn==0.24.2
- tqdm==4.63.1
- networkx==2.5.1
- torch==1.10.2+cu113
- torch_geometric==2.0.3
All of the packages can be installed through pip. Although not necessary, we strongly recommend GPU acceleration and conda for package management.
DGP-AMIO integrates different gene interaction networks, all of which can be downloaded from public databases. We provide preproccessed 10 gene interaction networks we used in the graph
folder.
The preproccessing method is included in our paper.
Omics data as node (gene) features and known disease genes as labels are essential to train DGP-AMIO. We mainly used gene expression data from GEO. Other omics like DNA methylation and gene mutation
can also be integrated as long as the data is provided. Known disease genes are collected from Malacards. Here we provide the preproccessed gene expression data (GSE143303 and GSE184942) and disease genes of asthma and Alzheimer
in the diseases
folder, so that you can directly run DGP-AMIO.
If you want to run DGP-AMIO for disease gene prediction, simply run:
python DGP-AMIO.py <disease name>
DGP-AMIO.py
does the following:
- Data preproccessing, and integrate gene interaction networks with omics and labels
- Train DGP-AMIO based on 5-fold cross validation and save the trained models
- Evaluate DGP-AMIO's performance on test set
- Load the trained models to predict on all unknown genes and save the results
<disease name>
is the disease you want to train DGP-AMIO and predict. Note that <disease name>
needs to be consistent with the name of the folder that stores the corresponding disease data in the diseases
directory to ensure proper data reading by the program.
- If you want to integrate other gene interaction networks, please organize them in the form of edge lists with a size of n×2, where n is the number of directed edges (following the format of the provided gene network files), and save the edge lists in CSV files named
edges_<database name>.csv
and place them in thegraphs
folder. After this, please don't forget to include<database name>
in the list variabledatabases
inDGP-AMIO.py
- If you want to use your own expression data, please organize the data in a CSV file named
<disease name>_expression.csv
where the first column is the gene names and place it in the folder of the corresponding disease. The code can run without any modification. If you want to use other omics or integrate multiomics simultaneously, you need to replace the expression data with your own data and make necessary modifications to the data reading part ofDGP-AMIO.py
to ensure its proper execution - If you want to use other labeled data, just replace the
<disease name>_gene.csv
with your own labeled data - If you want to train DGP-AMIO and make predictions of a new disease, you need to create
diseases/<disease name>
folder and place the prepared omics data and known disease genes in it - Note that when preparing your own data, please represent genes in all your data files with gene symbols.