PIGNON is a protein-protein interaction (PPI)-guided functional enrichment analysis for quantitative proteomics. This algorithm measures the clustering of proteins with a shared Gene Ontology (GO) annotation within the provided PPI network weighted with quantitative proteomics data. The significance of this clustering measure is then estimated from a normal distribution approximated from a Monte Carlo Sampling Distribution. To correct for multiple hypothesis testing, we assess the false discovery rate at various thresholds against a null model. We tested PIGNON using a breast cancer dataset generated by Tyanova et al.
PIGNON is a Java application that can be run from the command line. You will need to download the PIGNON.jar file.
Note: We recommend running a first instance of PIGNON on your chosen PPI network with your quantitative data and running a second instance without the quantitative data in order to eliminate results that are significant only due to the innate network topology.
- Java Version 8+
- Required library: The Apache Commons Mathematics Library (commons-math3-3.6.1.jar)
Examples files can be found under: input_files
Example BioGRID repository : BIOGRID-ORGANISM-Homo_sapiens-3.4.161.tab2.txt
PIGNON is currently set up to run on either the BioGRID or STRING networks. In the params file: you will need to specify the network type either BioGRID (0) or STRING (1) and the taxonomy ID of the species eg. human (9606).
For an alternative PPI network, you can format your network as a tab delimited file where each row is an interaction formatted as specified below. In the params file: you will need to specify the network type either BioGRID (0), you should leave the taxonomy ID blank.
#####Required format (note: the column numbers of the required information are in italics, the other columns can be blank)
2 EntrezID 1 |
3 EntrezID 2 |
8 HGNC symbol 1 |
9 HGNC symbol 2 |
16 SpeciesID 1 |
17 SpeciesID 2 |
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6416 | 2318 | MAP2K4 | FLNC | 9606 | 9606 |
Example mapping file : mapStringProteins_9606.v11.tsv
This tab-delimited text file was generated by combining the STRING accessory files 9606.protein.info.v11.0.gz and human.entrez_2_string.2018.tsv.gz. The mapping file is formatted as follows:
hgnc_symbol | protein_external_id | entrezID |
---|---|---|
ARF5 | 9606.ENSP00000000233 | 381 |
3. Propagated Gene Ontology terms
Example functional annotation file : GO_annotations-9606-inferred-allev-2.tsv
PIGNON is currently set up to run using only this type of annotation file.
Alternatively you can format your annotations as a tab delimited file where every row is a new annotation. Required information :
- column 1: Annotation ID
- column 2: Annotation Name (can be blank)
- column 3: Annotation descriptor (can be left blank)
- column 7: List of EntrezGene IDs, where elements are separated by a pipe (|)
- column 8: List of HGNC symbols, where elements are separated by a pipe (|)
1 AnnotationID |
2 Annotation Name |
3 Annotation descriptor |
7 EntrezGene IDs |
8 hgnc_symbols |
|||
---|---|---|---|---|---|---|---|
GO:0000015 | phosphopyruvate hydratase complex | cellular_component | 2023|2026|2027|387712 | ENO1|ENO2|ENO3|ENO4 |
Example quantitative proteomics dataset : formatted-BreastCancerProteinExpression.txt
This is a tab delimited text file where each row represents the quantitative information for a given gene/protein in 2 or more conditions. Required information:
- column 1: HGNC_symbol
- columns 2-n: quantified values in the 2 studied conditions. It is important that the column labels for each condition corresponds to the labels identified in the params file (ie. if in the params file condition1 = Her2, in this file all columns for condition1 should be labelled Her2.n). The order of the columns is not important.
Any missing values should be represented by NA.
HGNC_symbol | Condition1.1 | Condition1.2 | Condition2.1 | Condition2.2 | ConditionX.n |
---|---|---|---|---|---|
STARD13 | 2.5 | 1.8 | 0.7 | NA | ... |
This is a tab delimited text file where each row represent the quantitative information for a given gene/protein accross 2 conditions. Required information:
- column 1 : HGNC_symboll
- column 2 : fold-change of protein accross 2 conditions of interest
HGNC_symbol | Fold-change |
---|---|
STARD13 | 0.432 |
Example params file : params.txt
A template of this text file must be used to run PIGNON. It is passed to the program as command line argument.
It is important to specify the working directory, file paths and the proper parameters. A detailed explanation of these parameters can be found here.
Note: These files will be generated in a sub-directory of your working directory labelled IO_files
which will be automatically generated by the program
- Initial distance matrix
- Final distance matrix of fully connected component
- Monte Carlo distribution (generated in a sub-directory
mcDistribution
automatically generated by the program) - Normal Distribution parameters calculated from the Monte Carlo Distribution File
- shuffle Gene Ontology set
Note: These files will be generated in a sub-directory of your working directory labelled output_files
which will be automatically generated by the program
- false discovery rates at significant thresholds mapping (.tsv) : this file should be used to identify an FDR cut-off
- Stats summary of tested GO terms (.tsv)
- Detailed results for every GO annotation (.tsv): this file contains the biological results of interest to users
Note: we recommend running PIGNON on a computer with a minimum of 8GB of RAM. The program can run for up to 24hrs.
-
Download the
PIGNON.jar
file. -
Prepare your input files as specified above.
-
Open a terminal (mac/linux) or command prompt (windows) and navigate to where
PIGNON.jar
is stored. -
Enter command:
java -Xmx8g -jar PIGNON.jar file/path/params.txt