This is the package of our winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge.
background: Proteogenomics Challenge
see also: Hongyang Li and Yuanfang Guan's 1st Place Solution
Please contact (hyangl@umich.edu or gyuanfan@umich.edu) if you have any questions or suggestions.
Git clone a copy of code:
git clone https://github.com/GuanLab/phosphoproteome_prediction.git
- R (3.4.3)
- python (3.6.5)
- numpy (1.13.3). It comes pre-packaged in Anaconda.
- scikit-learn (0.19.0) A popular machine learning package. It can be installed by:
pip install -U scikit-learn
All the omic data are 2D matrices, where columns are cancer samples and rows are genes/proteins/phosphorylation sites. The proteomic and phosphoproteomic data originally came from CPTAC-breast and CPTAC-ovary. The genomic data originally came from TCGA-breast and TCGA-ovary.
During the challenge, we directly downloaded these data from the challenge website. Unfortunatelly, this download link s no longer available for unregistered users. We therefore provided examples of dummy data in the directory data/raw/.
- retrospective_breast_CNA_sort_common_gene_16884.txt
- retrospective_breast_phospho_sort_common_gene_31981.txt
- retrospective_breast_proteome_sort_common_gene_10005.txt
- retrospective_breast_RNA_sort_common_gene_15107.txt
- retrospective_ova_CNA_sort_common_gene_11859.txt
- retrospective_ova_JHU_proteome_sort_common_gene_7061.txt
- retrospective_ova_phospho_filtered.txt
- retrospective_ova_phospho_sort_common_gene_10057.txt
- retrospective_ova_PNNL_proteome_sort_common_gene_7061.txt
- retrospective_ova_rna_seq_sort_common_gene_15121.txt
Then preprocess the data and generate 5-fold cross validatation using code in
- data/trimmed_set
- data/normalization
- data/cv_set
We have two sets of code in parallel
- prediction/breast
- prediction/ova
This model directly approximates the phosphorylation level based on the corresponding parent protein level.
prediction/breast/proteome/
This model considers the protein-protein interactions in regulating phosphorylation, in which all protein levels were used as features to make predictions. The base learner is random forest with maximum depth of 3 and 100 trees.
prediction/breast/individual/
Similar to the "site-specific" model, this model uses combined samples from breast and ovarian cancer samples.
prediction/breast/individual_transplant/
This model considers the associations between phosphorylation sites of the same parent protein.
prediction/breast/multisite/
Our final results are the ensemble of the 1-4 models mentioned above.
prediction/breast/final/
analysis_sub3/