rformassspectrometry/Spectra

Add a filterUniquePrecursorMz method

jorainer opened this issue · 4 comments

Add a filtering function filterUniquePrecursorMz that reduces a Spectra object selecting for groups of spectra with similar precursor m/z (given ppm and tolerance) the one with the highest precursor intensity.

Could you elaborate on the difference with filterPrecursorMz(). I understand that with this new method, you in addition want the one with the highest precursor intensity... but can't this be done in another way, or about a function that does that (say filterPrecursorMaxIntensity()).

In other words, we have a function that does something similar... couldn't we offer a generic tools to use in combination of what we already have?

filterPrecursorMzValues and filterPrecursorMzRange take target m/z values to filter (the first allows to keep spectra with matching precursor m/z, the second spectra within a specified m/z range.

What I would like to have here is: assuming that fragment spectra were generated from the same ions (thus the spectra have similar precursor m/z values), select for each set the fragment spectrum with the highest precursor intensity. So, basically, I don't know the precursor m/z beforehand (like I would with the filterPrecursorMzValues) but want to reduce the Spectra to unique spectra/precursor m/z.

so, the definition of the method would be:

setMethod("filterPrecursorMaxIntensity", "Spectra", function(object, ppm = 10, tolerance = 0))

and the function internally would use the MsCoreUtils::group function to group the spectra based on their precursor m/z values (given ppm and tolerance) and for each group return the one spectrum with the highest precursor intensity.

I hope I explained it well enough...

Ok, thank. Now that I realise that you don't have the precursor MZ values beforehand, it makes more sense. Please, go ahead :-)

Small benchmark of two implementations of the function:

filterPrecursorMaxIntensity <- function(x, tolerance = 0, ppm = 10) {
    pmz <- precursorMz(x)
    pmi <- precursorIntensity(x)
    idx <- order(pmz, na.last = NA)
    mz_grps <- group(pmz[idx], tolerance = tolerance, ppm = ppm)
    if (any(duplicated(mz_grps))) {
        keep <- is.na(pmz)
        keep[vapply(split(idx, as.factor(mz_grps)),
                    function(z) {
                        if (length(z) == 1L) z
                        else z[which.max(pmi[z])]
                    }, integer(1L),
                    USE.NAMES = FALSE)] <- TRUE
        x <- x[keep]
    }
    x
}

filterUniquePrecursorMz <- function (querySpectra,minSize=3)
{
	## pg_filt = querySpectra[which (sapply (mz (querySpectra),length) >= minSize)]
    pg_filt <- querySpectra
	## Order by precusrsor mass
	pg_filt = pg_filt[order (precursorMz (pg_filt))]
		## Bin by precirsor mass and remove bins duplicated precursor ions (keep higher intensity)
	pg_mat = cbind ("mz" = precursorMz (pg_filt),"intensity" = precursorIntensity (pg_filt))
	mz_grps = MsCoreUtils::group (pg_mat[,"mz"], tolerance=0, ppm=10)
	if (any (duplicated (mz_grps))) {
		no_dup = tapply (1:length (mz_grps),mz_grps,function (i) {
			if (length (i) == 1) { return (i) }
			else { return (i[which.max (pg_mat[i,"intensity"])]) }
		})
		pg_filt = pg_filt[no_dup]
	}
	return(pg_filt)
}

The second implementation does in addition also an internal filter on the number of fragment peaks. I would however suggest (also for performance issues and because it is cleaner for the user) to do that outside of the function. This can be easily done with e.g. sps <- sps[lengths(sps) > 3]. Also, the second version does silently re-order the spectra which can be confusing for the user (he/she might not expect that) and works only for Spectra with only MS2 spectra.

library(Spectra)
library(microbenchmark)
fl <- system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                  package = "msdata")
sps_dda <- Spectra(fl)

## filter to MS2 data
tmp <- filterMsLevel(sps_dda, 2L)

microbenchmark(filterPrecursorMaxIntensity(tmp), filterUniquePrecursorMz(tmp))
Unit: milliseconds
                             expr      min       lq     mean   median       uq
 filterPrecursorMaxIntensity(tmp) 3.174594 3.261636 3.589510 3.408911 3.519393
     filterUniquePrecursorMz(tmp) 4.685477 4.848344 5.230798 5.089792 5.221355
      max neval cld
 15.88249   100  a 
 12.24396   100   b