Mass Spectrometry analysis using Empirical and Replicate based statistics.
MS-EmpiRe is a R package for quantitative analyses of Mass Spectrometry proteomics data. It allows highly sensitive and specific identitification of differentially abundant proteins between different experimental conditions.
MS-EmpiRe requires the R package Biobase
from Bioconductor.
Biobase
can be installed from the R command line using the following
commands:
source("https://bioconductor.org/biocLite.R")
biocLite("Biobase")
You can install MS-EmpiRe directly from github using the R package devtools
.
Installing devtools
:
install.packages("devtools")
Loading devtools
:
library(devtools)
Installing MS-EmpiRe:
install_github("zimmerlab/MS-EmpiRe")
Loading MS-EmpiRe:
library(msEmpiRe)
The file example.R
shows an example analysis workflow for simple table input data. The first column of the table contains the peptide/protein id, which is encoded as follows: proteinID.peptideID
. The remaining columns contain the measurements for replicate samples from two conditions.
MS-EmpiRe currently offers the following two functions to read data of your quantitative proteomics setup:
read.standard(table, sample.mapping, signal_pattern, prot.id.col, prot.id.generator)
for simple tablesread.MaxQuant(peptides, sample.mapping)
for output generated by MaxQuant
Both functions return an ExpressionSet
object (part of the Biobase
package) which can be used for further analysis.
table
has to be a table containing one row per peptide. Each row has to contain at least the measured signals for each sample/replicate. Any additional columns will be stored in the feature data slot of the ExpressionSet
class. signal_pattern
has to be a regular expression that only matches columns that contain measurements. Either prot.id.col
or prot.id.generator
can be used to determine the peptide to protein mapping. prot.id.generator
should be a lambda expression that allows to extract the protein id from the peptide id column (e.g. if the peptide ids follow the pattern proteinID.peptideID
like in the example). prot.id.col
has to be a column that already contains the protein id for each peptide.
sample.mapping
has to be a table containing two columns, named sample
and condition
. It is used to determine which samples are replicates for which condition.
Note: read.MaxQuant
currently does not generate a peptide to protein mapping since we want to use the mapping from the proteinGroups.txt MaxQuant output (see Data filtering).
MS-EmpiRe draws its power from replicate measurements. We therefore suggest to remove peptides which were not measured in multiple replicates per condition. With the function filter_detection_rate(data, rate=2)
one can remove all peptides which were detected in less than rate
replicates per conditions.
If the input data was generated by MaxQuant and read by read.MaxQuant
, we suggest to additionally use the function filter_MaxQuant(data, proteinGroups)
. It requires the proteinGroups.txt file which is usually generated by MaxQuant. Based on this file, the peptide to protein mappings is created. Furthermore, proteins with undesired features like "reverse" or "contaminant" are removed.
To correct for sample specific biases, MS-EmpiRe ships with a normalization method that minimizes the changes between replicate measurements for each peptide. A more detailed description of the method can be found in [1]
. It can be accessed using the function normalize(data, out.dir=NULL)
. data
has to be an ExpressionSet
type object (preferably from one of the two input generation functions, see section Reading Input). If out.dir
has a value different from NULL
, MS-EmpiRe will create detailed plots for the data (before and after normalization) inside out.dir
. The returned object is an ExpressionSet
that contains the normalized values in the exprs
slot.
For the detection of differential proteins, run the function de.ana(data)
where data
is the ExpressionSet
after filtering and normalization. de.ana
returns a data frame with one row per protein. Protein ID's can be accessed from the column prot.id
. The p-value after outlier corrections (see [1]
) is named p.val
, the corresponding value after multiple testing correction (Benjamini-Hochberg) is named p.adj
. The columns prot.p.val
and prot.p.adj
contain the respective values before outlier correction. The protein level (log2) fold change estimate is named log2FC
.
MS-EmpiRe is released under the GNU Affero General Public License. See LICENSE for further details.
[1] Ammar, C.*, Gruber, M.*, Csaba, G.*, Zimmer, R. (2019).
MS-EmpiRe Utilizes Peptide-level Noise Distributions for Ultra-sensitive Detection of Differentially Expressed Proteins.
Mol. Cell Proteomics, 18(9), 1880-92. doi:10.1074/mcp.RA119.001509