MAPPIN (Multiple Alignment for Protein Protein Interactions Networks) a global many-to-many alignment of multiple PPINs from different species.
Written By: Warith Eddine DJEDDI (waritheddine@yahoo.fr), Sadok BEN YAHIA (sadok.benyahia@fst.rnu.tn) and Engelbert MEPHU NGUIFO (mephu@isima.fr)
This README describes the usage of the command line interface of MAPPIN. It is worth mentioning that our approach is based on NetCoffee.
The executable MAPPIN is compiled for Linux x86_64 platform.
The program MAPPIN finds a global alignment of multiple input networks. Given multiple networks with N1, N2,..., Nk nodes each, it returns a matching between the input networks, each match corresponding to best-matching nodes from the multiple networks.
To understand how to use the algorithm, let's start with an example of pairwise alignment of two input networks. The multiple case is similar.
(1) Suppose the species are named 'A' and 'B'
(2) Create the graph files and results from the BLAST runs (between the nodes) in a single directory. You'll need 5 files:
(2.1) Network files: You'll need A.pin and B.pin , tab-separated files where each line contains an interaction. For example, the first 5 lines of A.pin are:
====== BEGIN ========
INTERACTOR_A INTERACTOR_B
a0 a1
a0 a2
a0 a3
a0 a4
====== END ========
Columns are separated by tabs. The first line is a header line of the
form as shown above. All other lines describe an interaction, one per
line.
There may be a third column which contains edge weights (0 < wt <= 1)
and in that case the header line should've a third column titled Weight_VAL.
(2.2) Results of BLAST runs. You'll need to perform an the all-against-all run of BLAST between all the nodes of the two networks. You should store the results in 3 files:
A-B.sim, A-A.sim, B-B.sim
The files contain, as their names indicate, the results of BLAST runs
between species A & B, A & A, and B & B, respectively. IMPORTANT: for
files containing Blast scores between two species, the filename should
have the species names in lexicographic order, i.e., A-B.sim is
expected, not B-A.sim.
The first 5 lines of the A-B.sim file are:
====== BEGIN ========
a0 b0 1
a1 b1 1
a2 b2 1
a3 b3 1
a4 b4 1
====== END ========
Each line is of the form:
<id1> <id2> <Bit-Score>
(2.3) Gene ontology (GO) File
IMPORTANT: Download gene ontology file from the Website "http://www.geneontology.org/" and uncompress it
under the "MAPPIN\data" folder. The name should be "gene_ontology.1_2.obo" under the data folder.
(2.4) GO annotation file which contains GO annotations for proteins in the input networks. The format of this GO annotation file should be compliant with the GO consortium.
A) File: goa_uniprot_gcrp.gaf → This set contains all GO annotations for canonical accessions
from the UniProt reference proteomes for all species, which provide one protein per gene.
B) Files: goa_<species>.gaf for each species → This set contains all GO annotations for canonical accessions
from the UniProt reference proteome for the species, which provides one protein per gene.
IMPORTANT: Download gene annotation files for each species from the Uniprot Website "http://www.ebi.ac.uk/goa/downloads" and uncompress them under the "MAPPIN\data\goa" folder.
(3) Create a file that specifies the file locations, species names etc. In the folder config/, the file "policy.input".
For example, suppose we have three PPI networks: a.pin, b.pin, c.pin, GO association file for the 3 species: a.gaf, b.gaf, c.gaf, six blast e-value (or bitscore) files: a-a.sim, a-b.sim, ..., c-c.sim, then the policy file should looks like:
--------------BEGIN OF POLICY--------------
PARAMETER VALUE
network dataset/a.pin
network dataset/b.pin
network dataset/c.pin
goafile data/goa/a.gaf
goafile data/goa/b.gaf
goafile data/goa/c.gaf
efile dataset/a-a.sim
efile dataset/a-b.sim
efile dataset/a-c.sim
efile dataset/b-b.sim
efile dataset/b-c.sim
efile dataset/c-c.sim
scorefile dataset/score_composit.model
alignmentfile ./result/alignment_mappin.data
logfile ./result/measure_time.txt
--------------END OF POLICY--------------
(4) Call the code. Here are samples:
(4.1) Network files using blast e-value:
./mappin -alignment -alpha 0.3 -nmax 1000 -temp 50 -thr 0.3 -numspecies 3 -numthreads 8 -alignmentfile ./result/alignment_mappin.data -resultfolder ./result/
(4.2) Network files using blast bitscore:
./mappin -alignment -alpha 0.3 -nmax 1000 -temp 50 -thr 0.3 -numspecies 3 -bscore true -numthreads 8 -alignmentfile ./result/alignment_mappin.data -resultfolder ./result/
The options are as follows (you can also use the "-h" or "--help" flag):
Usage:
./mappin -version|-alignment [--help|-h|-help] [-alignmentfile str]
[-alpha num] [-bscore] [-edgefactor num] [-nmax int] [-numspecies int]
[-numthreads int] [-resultfolder str] [-temp int] [-thr num]
Where:
--help|-h|-help
Print a short help message
-alignment
Execute the alignment algorithm.
-alignmentfile str
The filename of alignment of protein-protein interaction networks
-alpha num
Prameter controlling how much biological score contributes to the alignment score. Default is 0.3.
-bscore
Define bitscore as the similarity for the edges.
-edgefactor num
The factor of the power law normalization. Default is 0.1.
-nmax int
The parameter N for Simulated Annealing algorithm.
-numspecies int
Number of the species used in the aligning process. Default is 3.
-numthreads int
Specifies the number of threads used to run MAPPIN.
The recommended number of threads is the number of cores available in the computer.
-resultfolder str
The folder which was used to store the alignment results.
-temp int
The number of iteration for Simulated Annealing algorithm.
-thr num
The threshold for the alignment of protein-protein interaction networks.
-version
Show the version number of MAPPIN.
(5) The output
The main files are :
(5.1) The output of the alignment is located in the file "alignment_mappin.data". Each protein is represented
by a string (separated by a tab), and each interaction is on a single line.
(5.2) The affectation of the gene annotations for each protein belonging to the N input networks is located
in the file "affectProtToGene.txt".
(5.3) The number of unkown functional protein is located in the file "alignUnknown.result".
(5.4) The required time to align the input networks is located in the file "measure_time.txt".
(5.5) The sequence, topological and functional similarities for each pair of compared protein
is located in the file "scoreRecords.txt".
(5.6) The informations : Alignment edges conserved for each species, the number of alignment nodes,
the Alpha parameter, nmax and the Alignment score, are located in the file "alignment_statistics.data".
(5.7) The number of unknown alignment records, and the number of proteins annotated with MF,BP and CC in alginment graph,
are located in the file "statistics.result".
(6) EXAMPLE (6.1) Download MAPPIN freely available at Github website: https://github.com/waritheddine/MAPPIN
(6.2) Run MAPPIN on our test dataset with command:
./mappin -alignment -alpha 0.3 -nmax 1000 -temp 50 -thr 0.3 -numspecies 8 -bscore true -numthreads 8 -alignmentfile ./result/alignment_mappin.data -resultfolder ./result/
(6.3)Then you can find the all the involved output files in ./result/ . There are many other functions which you can see with "-help" option.
(6.4) We note that checking the format of the files in the data folder after reading the execution instructions above might be quite helpful.
(6.5) For this example, download the gene annotation file (goa_arabidopsis.gaf.gz, goa_worm.gaf.gz, goa_fly.gaf.gz, goa_ecoli.gaf.gz, goa_human.gaf.gz, goa_mouse.gaf.gz, goa_rat.gaf.gz, goa_yeast.gaf.gz) for the eight species
(Arabidopsis, Worm, Fly, Ecoli, Human, Mouse, Rat and Yeast) from the Uniprot Website "http://www.ebi.ac.uk/GOA/downloads" and uncompress them under the "MAPPIN\data\GOA" folder. In addition, download the compress file "goa_uniprot_gcrp.gaf.tar.gz" from the same Url, and also uncompress it under the "MAPPIN\data\GOA" folder.
Finally, download gene ontology file from the Website "http://www.geneontology.org/" and uncompress it under the "MAPPIN\data" folder.