Platelet proteomics to understand the pathophysiology of immune thrombocytopenia: studies in mouse models

Proteomics data analysis (DDA, label-free) performed in the following article:

Patricia Martínez-Botía, Marjolein Meinders, Iris M De Cuyper, Johannes A Eble, John W Semple, Laura Gutiérrez; Platelet Proteomics to Understand the Pathophysiology of Immune Thrombocytopenia: Studies in Mouse Models. Blood Adv 2022; bloodadvances.2021006438. doi: https://doi.org/10.1182/bloodadvances.2021006438

Mass spectrometry data analysis

The RAW MS files were processed with the MaxQuant computational platform (version 1.6.17.0) running on Linux.^1,2 Searches were configured for label-free quantification (LFQ), with the options LFQ, match between runs and iBAQ, while the rest of the parameters were set to default. Peptides and proteins were identified using the Andromeda³ search engine against the Mus musculus UniProt Swiss-Prot protein database (downloaded November 2020, 17,051 entries). The ‘proteinGroups.txt’ output table was further analyzed within the R environment (version 4.0.3) (R Project for Statistical Computing).⁴ Proteins were filtered for potential contaminants, only identified by site and reverse hits, and iBAQ intensities were log2-transformed. We did not further consider biological replicates that did not pass quality control tests (namely > 35% of missing values). This resulted in the exclusion of some samples that would otherwise have confounded the results.

Protein hits were filtered in when they were present in all samples of at least one group. Intensities were quantile normalized, followed by imputation of missing data by the quantile regression imputation of left-censored data (QRILC) approach.⁵ Next, we removed the batch effect created by the different mouse models using the conservative ‘removeBatchEffect’ function from the ‘limma’ package.⁶ The resulting dataset was subsequently fed to a linear model combined with empirical Bayesian statistics implemented by this same package, for the differential expression analysis of proteins among every pairwise comparison. Differential expression of protein hits with adjusted p-values (Benjamini-Hochberg method) lower than 0.05 were considered significant, in any given comparison, and were used for further analysis.

Data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD028814.

Code for this part are the prep_data.R and multiple_anova.R scripts. Results from the first one are fed into the second script. Starting data is ‘proteinGroups_mbr_v1.6.17.txt’, which can also be found in ProteomXchange, along with the raw data.

Correlation-based network analysis

Weighted gene correlation network analysis (WGCNA) was performed on the differentially expressed proteins (DEPs) using the ‘WGCNA’ R package⁷ and the tailored workflow designed by Wu et al. (2020) for proteomics data.⁸ A 0.85 correlation coefficient threshold was chosen, while a soft threshold of 26 was selected for the construction of the weighted correlation and adjacency matrix by using the approximate scale-free network criteria. Of note, a signed network is constructed, so that only positively correlated proteins have a strong connection. Next, the topology overlap metrics (TOM) and its distances (1 – TOM) are calculated from the adjacency matrix in a step-wise fashion. The resulting protein sets are hierarchically clustered (average method) based on the TOM distance, and an optimal set of modules was determined by using a dynamic tree-cutting algorithm, with a minimum of 10 proteins in each module, followed by merging close clusters (cut height of 0.1). Additionally, module eigenproteins (ME) are generated by value decomposition of the first principal component and signed module memberships (kME) are obtained. Finally, module-trait (Pearson) correlations were calculated.

Code for this part can be found in the wgcna_analysis.R script, and accompanying metadata are ‘groups_wgcna.xlsx’ and ‘groups_traits.xlsx’. Data come from the output of the multiple_anova.R script.

Annotation, gene ontology (GO) and pathway enrichment

UniProt IDs were mapped to their respective Entrez Gene IDs, Symbols and Gene names using the ‘AnnotationDbi’⁹ (version 1.52.0) and ‘org.Mm.eg.db’¹⁰ (version 3.12.0) R packages. Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome enrichments on gene lists were obtained using the ‘goseq’¹¹ (version 1.36.0), and the ‘msigdbr’¹² (version 7.2.1) R packages. Intensity bias was taken into account, and the enrichment was calculated using the Wallenius approximation. The resulting P values were adjusted for multiple testing using the Benjamini-Hochberg correction, and an adjusted P value of <0.05 was considered significant. Barplots were constructed with ‘ggplot2’¹³, ‘dplyr’¹⁴, ‘stringr’¹⁵ and ‘wesanderson’¹⁶ R packages.

Code from this last part can be found in the enrichment.R and vis_enrich.R scripts. The former contains the enrichment analysis, while the latter contains the figures. For both of them, it is necessary to run the script(s) for every module/color.

References

1. Cox J, Mann M. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nature biotechnology. 2008;26(12):1367–1372.

2. Sinitcyn P, Tiwary S, Rudolph J, et al. Maxquant goes linux. Nature methods. 2018;15(6):401–401.

3. Cox J, Neuhauser N, Michalski A, et al. Andromeda: A peptide search engine integrated into the MaxQuant environment. Journal of proteome research. 2011;10(4):1794–1805.

4. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2020.

5. Lazar C. imputeLCMD: A collection of methods for left-censored missing data imputation. 2015.

6. Ritchie ME, Phipson B, Wu D, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic acids research. 2015;43(7):e47–e47.

7. Langfelder P, Horvath S. WGCNA: An r package for weighted correlation network analysis. BMC bioinformatics. 2008;9(1):1–13.

8. Wu JX, Pascovici D, Wu Y, Walker AK, Mirzaei M. Workflow for rapidly extracting biological insights from complex, multicondition proteomics experiments with WGCNA and PloGO2. Journal of Proteome Research. 2020;19(7):2898–2906.

9. Pagès H, Carlson M, Falcon S, Li N. AnnotationDbi: Manipulation of SQLite-based annotations in bioconductor. 2020.

10. Carlson M. Org.mm.eg.db: Genome wide annotation for mouse. 2020.

11. Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: Accounting for selection bias. Genome Biology. 2010;11:R14.

12. Dolgalev I. Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. 2020.

13. Wickham H. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York; 2016.

14. Wickham H, François R, Henry L, Müller K. Dplyr: A grammar of data manipulation. 2022.

15. Wickham H. Stringr: Simple, consistent wrappers for common string operations. 2019.

16. Ram K, Wickham H. Wesanderson: A wes anderson palette generator. 2018.

patmartinezb/ITP-mouse-proteomics

Platelet proteomics to understand the pathophysiology of immune thrombocytopenia: studies in mouse models

Mass spectrometry data analysis

Correlation-based network analysis

Annotation, gene ontology (GO) and pathway enrichment

References