BioNE: Integration of network embeddings for supervised learning

Overview

A network embedding approach reduces the complexity of analyzing large biological networks by converting the high-dimensional networks to low-dimensional vector representations. These lower-dimensional representations can then be used in machine learning prediction tasks such as link/association prediction. Several network embedding methods have been proposed with different approaches to obtain network features. We believe, rather than developing the new network embedding method, integrating them could offer complementary information about the network, and consequently better performance in prediction tasks. BioNE is a pipeline that applies a range of network embedding methods following the network preparation step and integrates the vector representations obtained by these methods using three different techniques. In this framework we focus on link prediction task.

The BioNE pipeline is divided into three steps;

1. Network Preparation
1.1. Convert Adjacency Matrix to Edge List
1.2. Heterogeneous Network Preparation
2. Network Embedding
3. Predictions Using the Integration of Embeddings

In order to install packages and create the necessary virtual environment, check section Virtual Environment and Installing Packages.
This pipeline will be tested on Drug-Target Interaction (DTI) data as a link prediction task. You can find the scripts of this test in the Example section.

Virtual Environment and Installing Packages

All of the analyses are written and tested on virtual environment using python 3.7. The detailed software versions are listed below:

Python 3.7
virtualenv 20.4.0
ubuntu 20.04
nvidia-driver 460
cuda 10.0
cuDNN 7.4.2

To create the virtual environment:

cd BioNE-main
virtualenv --python=/usr/bin/python3.7 BioNEvenv

To activate and install required packages:

source BioNEvenv/bin/activate
pip install -r requirements.txt

Input files formats

The input file format for Network Embedding is a space-delimited edge list file. If the edge list file is ready in this format, users can start from the Network Embedding step. If the networks are in adjacency matrix format, this pipeline provides the command line to convert adjacency matrices to edge lists in section 1.1. Convert Adjacency Matrix to Edge List. Adjacency matrices should contain column names and row names, and the format should be space-delimited. Click here to see sample adjacency matrix.

1. Network Preparation

This part consists of two sections. Users can convert adjacency matrices to edge list files in section 1.1. Convert Adjacency Matrix to Edge List. On the other hand, when required, users can combine two edge list files to form a heterogeneous network using command lines provided in section 1.2. Heterogeneous Network Preparation.

1.1. Convert Adjacency Matrix to Edge List

In order to conduct network embedding, adjacency matrices should be converted to an edge list file format.

python3 scripts/mat2edgelist.py --input input.txt --directed --keepzero --attribute --output output.txt

Arguments:

input The filepath of the adjacency matrix
Input adjacency matrix file should be space-delimited file and contains row and column index labels.
Click here to see a sample file.

directed Treat the graph as directed
When directed, row indexes are source nodes and column indexes are target nodes.

keepzero Adding negative associations (0s) to the output

attribute Including the edge attributes to the output file
If edge attributes are not going to be used as weights in network embedding, removing this line is recommended to save memory.

output The filepath for the output edge list file
The file will be saved as a space-delimited file. Click here to see a sample edge list file.

1.2. Heterogeneous Network Preparation

When required, users can combine two edge lists (e.g. drug-drug and drug-disease networks) to construct a heterogeneous network. The command line below can be used to combine edge lists. This can be used multiple times to combine more than two edge list files.

python3 scripts/merge_edgelist.py --input1 input1.txt --input2 input2.txt --rmduplicate --output output.txt

Arguments:

input1 The filepath of first edge list file
This file should be a space-delimited edge list file. Click here to see a sample input file.

input2 The filepath of second edge list file
This file should be a space-delimited edge list file. Click here to see a sample input file.

rmduplicate Removes duplicated edges

output The filepath for the output combined edge list file
The file will be saved as a space-delimited file. Click here to see a sample output file.

2. Network Embedding

Network embedding methods convert high-dimensional data to low-dimensional vector representations. In this project users are able to conduct the following embedding methods:
LINE, GraRep, SDNE, LLE, HOPE, LaplacianEigenmaps (Lap), node2vec, DeepWalk and GF.

python3 scripts/embedding.py --method lle --input input.txt --directed --weighted --representation_size 128 --output output.txt

Arguments:

method: Network embedding method
Choices are:
line (parameters: epochs, order, negative_ratio)
grarep (parameters: kstep)
sdne (parameters: alpha, beta, nu1, nu2, bs, lr, epochs, encoder-list)
lle
hope
lap
node2vec (parameters: walk_length, number_walks, workers, p, q, window_size)
deepwalk (parameters: walk_length, number_walks, workers, window_size)
gf (parameters: epochs, lr, weight-decay)

Note: input, directed, weighted, random_state and representation_size are shared among all methods.

input: The filepath of the edge list file
This file should be an space-delimited edge list. Click here to see a sample input file.

directed: Treats the network as directed
There is no need to use this if you already specified this in section 1.1.

weighted: Treat the network as weighted
To use this, edge attributes should be included in the edge list file. Check attribute argument in section 1.1.

random_state: Fixing the randomization
The default value is 1.

epochs: The number of times that the learning algorithm will work through the entire training data set
This parameter is used in line, sdne and gf. The default value is 5.

representation_size: Dimensionality of the output data
The default value is 128.

order: Choose the order of line
1 means first order, 2 means second order, 3 means first order + second order. The default value is 2.

negative_ratio: Negative sampling ratio
This parameter is used in line. The default is 5.

kstep: Use k-step transition probability matrix
This parameter is used in grarep. The default value is 2.

encoder-list: a list of neuron numbers in each encoder layer within sdne
The last number is the dimension of the output embeddings. The default is [1000,128].

alpha: alpha is a hyperparameter in sdne
The default value is 1e-6.

beta: beta is a hyperparameter in sdne
The default value is 1e-5.

nu1: nu1 is a hyperparameter in sdne
The default value is 1e-5.

nu2: nu2 is a hyperparameter in sdne
The default value is 1e-4.

bs: batch size in sdne
Number of training samples utilized in one iteration. The default is 200.

lr: learning rate in sdne
The learning rate controls how quickly the model adapts to the problem. The default is 0.001.

walk-length: Length of the random walk started at each node
This parameter is used in node2vec and deepwalk. The default value is 20.

number-walks: Number of random walks to start at each node
This parameter is used in node2vec and deepwalk. The default value is 80.

workers: Number of parallel processes
This parameter is used in node2vec and deepwalk. The default value is 8.

p: Return hyperparameter in node2vec
The default value is 1.

q: Inout hyperparameter in node2vec
The default value is 1.

window-size: Window size of skipgram model in node2vec and deepwalk
The default value is 10.

weight-decay: Weight for L2 loss on embedding matrix in gf
The default value is 5e-4.

output: The filepath for the embedding results
The file saves as a space-delimited file. Click here to see a sample output file.

3. Predictions using the integration of embeddings

For this section, we developed three different integration methods (late fusion, early fusion and mixed fusion) to integrate embedding results from the different methods. This ensures a comprehensive representation of networks and therefore better prediction performance.

python3 scripts/integration.py --fusion late --annotation annotation.txt --entity1-embeddings '["hope_x.txt","lap_x.txt"]' --entity2-embeddings '["hope_y.txt","lap_y.txt"]' --cv-type stratified --cv 10 --imbalance ADASYN --model '["RF"]' --output ./output

Arguments:

fusion The integration type
Choices are:
early: Merging all embedding results before passing to the prediction model
late (default): Including each embedding result in the prediction model and then summing up the achieved prediction probabilities.
mix: Merging all embedding results, and then summing up the prediction probabilities achieved from different prediction models.

annotation The filepath of the annotation file
This file should contain two columns. The first and second columns in the annotation file harbour the information of entity1 and entity2 respectively. Click here to see a sample annotation file.

entity1-embeddings filepaths of the embeddings containing the entities of the first column (entity1) in the annotation file
The file paths should be given in this format: '["deepwalk_drug.txt", "gf_drug.txt"]'.
When late fusion is applied, the entity1-embeddings and entity2-embeddings should have the same length with the same order of embedding methods.

entity2-embeddings filepaths of the embeddings containing the entities of the second column (entity2) in the annotation file
The file paths should be given in this format: '["deepwalk_protein.txt", "gf_protein.txt"]'.
When late fusion is applied, the entity1-embeddings and entity2-embeddings should have same length with the same order of embedding methods.

cv-type Cross-validation method
Choices are 'kfold', 'stratified' and 'split' (default). 'split' divides the data according to the test-size size.

cv Number of folds
This argument is used when the cv-type is either 'kfold' or 'stratified'.
Default value is 5.

cv-shuffle Whether to shuffle each class samples before splitting into batches
This argument is used when the cv-type is either 'kfold' or 'stratified'.

test-size Percentage of the data to be test-set
The value of this argument must be between 0 and 1. This can be used when cv-type is 'split'.
Default value is 0.2.

imbalance Deals with imbalanced classes
Choices are: 'equalize' which equalizes the number of majority class to minority class.
'SMOTE' is the oversampling method.
'None' (default) does not deal with imbalanced classes.

fselection feature selection
Choices are: 'fvalue', 'qvalue', 'MI' or None.
ANOVA analyses the differences among the means between classes. The output is either in 'fvalue' or pvalue.
ktop argument helps to select features with K highest 'fvalues'.
The 'qvalue' is the Bonferroni correction of p-values with values lower than 0.1.
The MI is based on mutual information. Here ktop helps to collect features with K highest MI value.

ktop Select K highest value features
Select features according to the k highest scores if feature selection is either fvalue or MI.
Default value is 10.

model Machine Learning models
Choices are 'SVM' (default), 'RF','NB' and 'XGBoost'.
The models should be given in this format: '["SVM"]'
In the case where mixed fusion is applied, the models should be given in this format: '["SVM","RF", "NB", "XGBoost"]'

random_state Fixing the randomization
Default value is None.

kernel Specifies the kernel type to be used in the algorithm
This can be used when classification is 'SVM'.
Default is 'linear'.

C Regularization parameter
The default value is 1. This can be used when model is SVM.

ntree The number of trees in the random forest
Default value is 100.

criterion The function to measure the quality of a split in random forest
Choices are 'gini' (default) and 'entropy'

njob The number of parallel jobs to run in random forest.

output The filepath for the predictions and evaluation results
Only provide directory and file prefix. e.g. ./Desktop/DTI_prediction
Click here to see a sample prediction output and here for ROC and PR curves.
In ROC and PR, the label of the positive class is fixed to 1.

Example

Here you can find the example of the Drug-Target interaction link prediction task.

# 1) Network Preparation

# Convert drug-drug and drug-disease adjacency matrices to the edge list
python3 scripts/mat2edgelist.py --input ./data/mat_drug_drug.txt --output ./output/edgelist/edgelist_drug_drug.txt
python3 scripts/mat2edgelist.py --input ./data/mat_drug_disease.txt --output ./output/edgelist/edgelist_drug_disease.txt
# Drugs heterogeneous network preparation
python3 scripts/merge_edgelist.py --input1 ./output/edgelist/edgelist_drug_drug.txt --input2 ./output/edgelist/edgelist_drug_disease.txt --rmduplicate --output ./output/edgelist/edgelist_hetero_drugs.txt


# Convert protein-protein adjacency matrix to the edge list
python3 scripts/mat2edgelist.py --input ./data/mat_protein_protein.txt --output ./output/edgelist/edgelist_protein_protein.txt


# 2) Embedding
# Embedding on drugs heterogeneous network. The hope and lap embeddings conducted
python3 scripts/embedding.py --method hope --input ./output/edgelist/edgelist_hetero_drugs.txt --representation_size 20 --output ./output/embedding/hope_20_hetero_drugs.txt
python3 scripts/embedding.py --method lap  --input ./output/edgelist/edgelist_hetero_drugs.txt --representation_size 20 --output ./output/embedding/lap_20_hetero_drugs.txt

# Embedding on protein-protein edgelist. The hope and lap embeddings conducted
python3 scripts/embedding.py --method hope --input ./output/edgelist/edgelist_protein_protein.txt --representation_size 20 --output ./output/embedding/hope_20_protein.txt
python3 scripts/embedding.py --method lap  --input ./output/edgelist/edgelist_protein_protein.txt --representation_size 20 --output ./output/embedding/lap_20_protein.txt


# 3) Predictions using the integration of embeddings
# Create annotation file
python3 scripts/mat2edgelist.py --input ./data/mat_drug_protein_remove_homo.txt--directed --keepzero --attribute --output ./output/edgelist/edgelist_drug_protein.txt
# late fusion
python3 scripts/integration.py --fusion late --annotation ./output/edgelist/edgelist_drug_protein.txt --entity1-embeddings '["./output/embedding/hope_20_hetero_drugs.txt","./output/embedding/lap_20_hetero_drugs.txt"]' --entity2-embeddings '["./output/embedding/hope_20_protein.txt","./output/embedding/lap_20_protein.txt"]' --cv-type kfold --cv 10 --imbalance equalize --model '["SVM"]' --random_state 11 --output ./output/prediction/DTI_prediction

Citation

Please consider citing the following publication if you found BioNE beneficial in your research:

@article{BioNE,
author = {Parvizi, Poorya and Azuaje, Francisco and Theodoratou, Evropi and Luz, Saturnino},
doi = {10.1101/2022.04.26.489560},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
title = {{BioNE: Integration of network embeddings for supervised learning}},
url = {https://www.biorxiv.org/content/early/2022/04/27/2022.04.26.489560}, year = {2022}
}

BioNE is also archived at Zenodo (https://doi.org/10.5281/zenodo.5500712)

Contact

If you have any questions, please submit an issue on GitHub or send an email to poorya.parvizi@ed.ac.uk.

License

Licensed under GPLv3 license

pooryaparvizi/BioNE

BioNE: Integration of network embeddings for supervised learning

Overview

Virtual Environment and Installing Packages

Input files formats

1. Network Preparation

1.1. Convert Adjacency Matrix to Edge List

Arguments:

1.2. Heterogeneous Network Preparation

Arguments:

2. Network Embedding

Arguments:

3. Predictions using the integration of embeddings

Arguments:

Example

Citation

Contact

License