MS-EmpiRe

Mass Spectrometry analysis using Empirical and Replicate based statistics.

MS-EmpiRe is a R package for quantitative analyses of Mass Spectrometry proteomics data. It allows highly sensitive and specific identitification of differentially abundant proteins between different experimental conditions.

Installation

Dependencies

MS-EmpiRe requires the R package Biobase from Bioconductor. Biobase can be installed from the R command line using the following commands:

source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")

Installing MS-EmpiRe

You can install MS-EmpiRe directly from github using the R package devtools.

Installing devtools:

install.packages("devtools")

Loading devtools:

library(devtools)

Installing MS-EmpiRe:

install_github("zimmerlab/MS-EmpiRe")

Loading MS-EmpiRe:

library(msEmpiRe)

Getting started

Quickstart

The file example.R shows an example analysis workflow for simple table input data. The first column of the table contains the peptide/protein id, which is encoded as follows: proteinID.peptideID. The remaining columns contain the measurements for replicate samples from two conditions.

Reading input

MS-EmpiRe currently offers the following two functions to read data of your quantitative proteomics setup:

read.standard(table, sample.mapping, signal_pattern, prot.id.col, prot.id.generator) for simple tables
read.MaxQuant(peptides, sample.mapping) for output generated by MaxQuant

Both functions return an ExpressionSet object (part of the Biobase package) which can be used for further analysis. table has to be a table containing one row per peptide. Each row has to contain at least the measured signals for each sample/replicate. Any additional columns will be stored in the feature data slot of the ExpressionSet class. signal_pattern has to be a regular expression that only matches columns that contain measurements. Either prot.id.col or prot.id.generator can be used to determine the peptide to protein mapping. prot.id.generator should be a lambda expression that allows to extract the protein id from the peptide id column (e.g. if the peptide ids follow the pattern proteinID.peptideID like in the example). prot.id.col has to be a column that already contains the protein id for each peptide.

sample.mapping has to be a table containing two columns, named sample and condition. It is used to determine which samples are replicates for which condition.

Note: read.MaxQuant currently does not generate a peptide to protein mapping since we want to use the mapping from the proteinGroups.txt MaxQuant output (see Data filtering).

Data filtering

MS-EmpiRe draws its power from replicate measurements. We therefore suggest to remove peptides which were not measured in multiple replicates per condition. With the function filter_detection_rate(data, rate=2) one can remove all peptides which were detected in less than rate replicates per conditions.

If the input data was generated by MaxQuant and read by read.MaxQuant, we suggest to additionally use the function filter_MaxQuant(data, proteinGroups). It requires the proteinGroups.txt file which is usually generated by MaxQuant. Based on this file, the peptide to protein mappings is created. Furthermore, proteins with undesired features like "reverse" or "contaminant" are removed.

Normalization

To correct for sample specific biases, MS-EmpiRe ships with a normalization method that minimizes the changes between replicate measurements for each peptide. A more detailed description of the method can be found in [1]. It can be accessed using the function normalize(data, out.dir=NULL). data has to be an ExpressionSet type object (preferably from one of the two input generation functions, see section Reading Input). If out.dir has a value different from NULL, MS-EmpiRe will create detailed plots for the data (before and after normalization) inside out.dir. The returned object is an ExpressionSet that contains the normalized values in the exprs slot.

Differential Analysis

For the detection of differential proteins, run the function de.ana(data) where data is the ExpressionSet after filtering and normalization. de.ana returns a data frame with one row per protein. Protein ID's can be accessed from the column prot.id. The p-value after outlier corrections (see [1]) is named p.val, the corresponding value after multiple testing correction (Benjamini-Hochberg) is named p.adj. The columns prot.p.val and prot.p.adj contain the respective values before outlier correction. The protein level (log2) fold change estimate is named log2FC.

License

MS-EmpiRe is released under the GNU Affero General Public License. See LICENSE for further details.

References

[1] Ammar, C.*, Gruber, M.*, Csaba, G.*, Zimmer, R. (2019).
    MS-EmpiRe Utilizes Peptide-level Noise Distributions for Ultra-sensitive Detection of Differentially Expressed Proteins.
    Mol. Cell Proteomics, 18(9), 1880-92.  doi:10.1074/mcp.RA119.001509

zjuezhen/test_MS-EmpiRe