This R package provides an implementation of the MaxLFQ algorithm by Cox et al. (2014) in a comprehensive pipeline for DIA-MS (Pham et al. 2020). It also offers options for protein quantification using the N most intense fragment ions, using all fragment ions, and the Tukey's median polish algorithm. In general, the tool can be used to integrate multiple proportional observations into a single quantitative value.
Citation
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics, Bioinformatics 2020 Apr 15;36(8):2611-2613. https://doi.org/10.1093/bioinformatics/btz961
The package is hosted on CRAN. It is best to install from within R.
install.packages("iq")
The package can be loaded in the usual manner
library("iq")
See a recent example for processing a Spectronaut output.
Or an older vignette for processing output from Spectronaut, OpenSWATH and MaxQuant with some visualization.
To process a DIA-NN output
process_long_format("report.tsv",
output_filename = "report-pg-global.tsv",
annotation_col = c("Protein.Names", "Genes"),
filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))
To process a Spectronaut output
Use this export schema iq.rs to make a long report, for example "Spectronaut_Report.xls".
process_long_format("Spectronaut_Report.xls",
output_filename = "iq-MaxLFQ.tsv",
sample_id = "R.FileName",
primary_id = "PG.ProteinGroups",
secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge", "F.FrgLossType"),
intensity_col = "F.PeakArea",
annotation_col = c("PG.Genes", "PG.ProteinNames", "PG.FastaFiles"),
filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
log2_intensity_cutoff = 0)
Processing a DIA-NN output
We processed the Bruderer15 dataset (Bruderer et al., MCP 2015) using DIA-NN 1.8.1 (Demichev et al., Nature Methods 2020) with the switch --report-lib-info
turned on. The switch gives an extra column Fragment.Info
to test iq's new feature. Usually, the user does not need to turn it on.
Download DIA-NN output report.zip and unzip the file to obtain the long format ouput report.tsv
.
A protein group report for downstream analysis can be obtained with a single statement in R
library(iq)
process_long_format("report.tsv", output_filename = "report-pg.tsv")
As an aside, usually the user wants to include additional annotation columns in the final report. This is possible by specifying the annotation_col
parameter.
process_long_format("report.tsv", output_filename = "report-pg-annotated.tsv",
annotation_col = c("Protein.Names", "Genes"))
Also, by default, we filter the report at 1% run-sepcific FDR at both precursor level and protein group level. To filter global FDR (available with a recent version of DIA-NN), use the parameter filter_double_less
.
process_long_format("report.tsv", output_filename = "report-pg-global.tsv",
annotation_col = c("Protein.Names", "Genes"),
filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))
We continue with loading the result file report-pg.tsv
to check the quantitative values of the spike-in proteins. For convenience, we will use the spike-in protein names instead of the protein group names.
pg <- read.delim("report-pg.tsv")
rownames(pg) <- pg$Protein.Group
spike_ins <- c("P02754", "P80025", "P00921", "P00366", "P02662", # mix 1
"P61823", "P02789", "P12799", "P02676", "P02672", # mix 2
"P02666", "P68082") # mix 3
# change rownames to spike_ins names
for (i in 1: length(spike_ins)) {
rownames(pg)[grep(spike_ins[i], rownames(pg))] <- spike_ins[i]
}
Here is the ground truth for proteins in Mix 1 (P02754, P80025, P00921, P00366, P02662). Each sample was measured in triplicate.
Sample | relative | fmol/ul |
---|---|---|
1 | 1 | 1.5 |
2 | 1.1 | 1.65 |
3 | 1.21 | 1.815 |
4 | 1.33 | 1.995 |
5 | 10 | 15 |
6 | 11.01 | 16.515 |
7 | 12.11 | 18.165 |
8 | 13.33 | 19.995 |
matplot(t(pg[spike_ins[1:5], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 1", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[1:5], col = 1:5, pch=19, bty = "n")
We see that the triplicates are consistent and the spikeins in first 4 samples (12 runs) are about 10 fold down as expected (log2(10) ~ 3.3).
Here is the ground truth of proteins in Mix 2 (P61823, P02789, P12799, P02676, P02672).
Sample | relative | fmol/ul |
---|---|---|
1 | 200 | 100 |
2 | 125.99 | 62.995 |
3 | 79.37 | 39.685 |
4 | 50 | 25 |
5 | 4 | 2 |
6 | 2.52 | 1.26 |
7 | 1.59 | 0.795 |
8 | 1 | 0.5 |
matplot(t(pg[spike_ins[6:10], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 2", xlab = "8 samples x 3 replicates")
legend("topright", legend = spike_ins[6:10], col = 1:5, pch=19, bty = "n")
In this mix, proteins in the first 4 samples are higher. Fold change between sample 1 and sample 5 is 50, which is approximately 5.6 in log2 space.
Here is the ground truth of proteins in Mix 3 (P02666, P68082).
Sample | relative | fmol/ul |
---|---|---|
1 | 1 | 0.05 |
2 | 4 | 0.2 |
3 | 16 | 0.8 |
4 | 64 | 3.2 |
5 | 256 | 12.8 |
6 | 1024 | 51.2 |
7 | 4096 | 204.8 |
8 | 16384 | 819.2 |
matplot(t(pg[spike_ins[11:12], 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 3", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[11:12], col = 1:2, pch=19, bty = "n")
In this mix, the protein concentration increases 4 fold from sample 1 to sample 8. Thus, we expect a staircase shape with an increase of 2 in log2 space.
set.seed(0)
matplot(t(pg[sample(1:nrow(pg), 12), 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Random 12 proteins", xlab = "8 samples x 3 replicates")
This is a set of 12 random proteins in the background. High abundance proteins show a consistent pattern while lower abundance proteins exhibits more variation and missing data, which is expected from mass spectrometry-based proteomics data.
Spectronaut output versus DIA-NN output
The iq package has a dataset of 12 spikein protein from the Bruderer 15 dataset processed by the Spectronaut software. We will turn this dataset into something like DIA-NN output to better understand the parameters. The following will produce a file spikeins-diann.tsv
in your workspace.
library(iq)
data("spikeins")
precursor <- paste(spikeins$EG.ModifiedSequence, spikeins$FG.Charge, sep = "_")
d <- NULL
for (p in unique(precursor)) {
b <- spikeins[precursor == p,]
for (s in unique(as.character(b$R.Condition))) {
ss <- b[b$R.Condition == s,]
d <- rbind(d, list(File.Name = s,
Protein.Group = as.character(ss$PG.ProteinGroups[1]),
Genes = as.character(ss$PG.Genes[1]),
Precursor.Id = p,
Fragment.Quant.Corrected = paste(ss$F.PeakArea, collapse = ";"),
Fragment.Info = paste(paste(ss$F.FrgIon, ss$F.Charge, sep="_"),
collapse = ";")))
}
}
write.table(d, "spikeins-diann.tsv", sep = "\t", row.names = FALSE, quote = FALSE)
Examine the file spikeins-diann.tsv
to see the difference with Spectronaut output, for example by typing head(spikeins)
in your R console.
This is an example code in standard iq processing. The result result$estimate
contains quantitative values of the 12 proteins. Note that the data has been normalized. So in the preprocessing, we set median normalization to FALSE
.
norm_data <- preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL)
protein_list <- create_protein_list(norm_data)
result <- create_protein_table(protein_list)
The standard R processing for DIA-NN output is as follows. Because the default values were set to Spectronaut output, you will need to specify the columns values. Note that intensity_col
and intensity_col_id
contain ;
separated values.
raw <- read.delim("spikeins-diann.tsv")
norm_data2 <- preprocess(raw,
sample_id = "File.Name",
primary_id = "Protein.Group",
secondary_id = "Precursor.Id",
intensity_col = "Fragment.Quant.Corrected",
median_normalization = FALSE,
intensity_col_sep = ";",
intensity_col_id = "Fragment.Info")
protein_list2 <- create_protein_list(norm_data2)
result2 <- create_protein_table(protein_list2)
The result of processing DIA-NN output should be equal to that of processing Spectronaut data (up to machine numerical accuracy).
max(result$estimate - result2$estimate)
min(result$estimate - result2$estimate)
identical(rownames(result$estimate), rownames(result2$estimate))
identical(colnames(result$estimate), colnames(result2$estimate))
You can plot a particular protein to see all its observed fragments.
plot_protein(protein_list2$P00366, main = "Protein P00366")
You can also use the fast implementation of MaxLFQ. Again, the parameters need to be specified because the data has been filtered and normalized.
iq_dat <- fast_read("spikeins-diann.tsv",
sample_id = "File.Name",
primary_id = "Protein.Group",
secondary_id = "Precursor.Id",
intensity_col = "Fragment.Quant.Corrected",
intensity_col_sep = ";",
intensity_col_id = "Fragment.Info",
filter_string_equal = NULL,
filter_double_less = NULL,
annotation_col = NULL)
iq_norm_data <- fast_preprocess(iq_dat$quant_table,
median_normalization = FALSE)
result_fastest <- fast_MaxLFQ(iq_norm_data,
row_names = iq_dat$protein[, 1],
col_names = iq_dat$sample)
max(result$estimate - result_fastest$estimate)
min(result$estimate - result_fastest$estimate)
identical(rownames(result$estimate), rownames(result_fastest$estimate))
identical(colnames(result$estimate), colnames(result_fastest$estimate))
Finally, the data can be processed in one statement. Note that the parameter normalization
is not a logical (TRUE
or FALSE
), but a string "none"
or "median"
(default). This is to support different normalization methods in the future.
process_long_format("spikeins-diann.tsv", output_filename = "spikeins-diann-out.tsv",
filter_double_less = NULL,
normalization = "none")
The output spikeins-diann-out.tsv
should be equal to the existing R processing.
out <- read.delim("spikeins-diann-out.tsv")
rownames(out) <- out$Protein.Group
out <- out[rownames(result$estimate),]
max(result$estimate - out[, colnames(result$estimate)])
min(result$estimate - out[, colnames(result$estimate)])