bigbio/quantms

proteomicsLFQ with new SVM results in UPS1 dataset

Closed this issue · 23 comments

Description of feature

@timosachsenberg @jpfeuffer @daichengxin we also have run it with parameters:

feature_with_id_min_score = 0.25
feature_without_id_min_score = 0.75

Dataset PXD001819, Files: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819NEWSAGE/

Results Table:

image (7)

Previous UPS detected:

image (8)

Current UPS detected:

image (9)

This issue needs to be discussion about the default parameters for MBRs. This issue is related to #287

@daichengxin can you provide a similar plot than the last one for Maxquant.

I think the numbers don't tell you that much here. Can we somehow see how well the quantities match? e.g., if we only picked up noise in the old version the new one would be better. (and the other way around)

You are right, this is why I ask for the same plot from MQ for the last plot. Our first version of the plot has a lot of noise quantities, and we have solved that, but we have to find the right parameters for both thresholds now. It looks like the current 0.25 and 0.75 is too stringent by the results in here #287

I'm running now both datasets with 0.10 and 0.90.

I agree. Can we do an automated evaluation maybe? We could even do a little step in nextflow and then launch the process multiple times with different parameters.
How long is the runtime with -resume in the last step?

But from my memory this actually looks a bit like the MQ results. 2500 amol was the turning point.

Except for the lowest concentration of course. In theory it does not make sense that you find more features in lower concentrations, if you didn't even find them in higher ones.
Unless, the higher amount of MS2 IDs in the higher concentrations and consecutively extracted features leads to an increased pressure to link untargeted features in the lowest concentration. And the more higher concentrations (relative to the concentration you are currently looking at) you have, the more features you try to link. Therefore the lowest concentration gets more "chances". Does this make sense?
I think in this case an FDR approach vs the same probability cutoff for every run would bring additional value.

I also agree with @timosachsenberg that we should do plots that look at the relative differences between each concentration, instead of just found features/proteins.
E.g. check the number of significantly deregulated proteins between samples.
(Although the higher number of found features in the lowest concentration remains a bit weird to me).

MaxQuant results provided by original paper :
PalombaA_pubmed_34038140_2.xlsx

edit: replaced P1-P9 with concentrations. @jpfeuffer

The same plot given by authors. https://europepmc.org/articles/PMC8280745/figure/fig2/. 2500 amol was the turning point. And ours results looks like better than MQ at low concentrations?

image

Can you replace P1 and P9 with something reasonable that includes the amol concentration in the header of the MQ xlsx?

Test results. Results folder: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/quantms-benchmark/PXD001819NEWSAGE/proteomicslfq/ image image image

@timosachsenberg @jpfeuffer for the 0.10 and 0.75 I used the threshold of 1000 for the intensity. It actually looks much better than 0.1 0.9 with intensity 10'000. What do you think?

Yes looks better imo. But we should really check expected fold changes.

upsReval.zip

Here is my old script. In the beginning it includes some fixes for MSstats loading and plotting which are probably not necessary anymore.
I am not sure when I will get to it. Maybe Friday.

I will try to make it run today and let you know. Do you have the original figures from MQ.

It can read MQ files and PD files, too. I use the results from https://github.com/wombat-p

Can you provide me a direct link to those outputs PD and MQ from wombat-p. Im now configuring your script.

Here the results @jpfeuffer of your script with this data:

plot_zoom_png

Looks good but I think not a large improvement to fold changes for comparisons with <=500 amol.

We are fine with that, I have the feeling we have reduced a lot the false positive signals in MBR and that is the major advantage. But as you said, nothing has happens in terms of the feature detection by itself in the low concentrations.

I agree that it might be enough since the fold changes did not get worse but the number of found features look a bit better. I think MQ identified way less in the lower concentrations (which doesn't mean much since they might actually be close to unquantifiability and looking at the quants maybe should stay unreported). We might be able to set a higher min threshold for unidentified features actually.
I am not sure if my MQ files are the best to use (old version and I don't know the settings used) but I can send them tomorrow.

It would be great if we could have some debug output on all traces of all features across a consensusFeature. Or the extracted areas.
@timosachsenberg we could also check the interpolation of quantities for traces that could not be fit (in the OpenSwath algos). I think you fixed something there but maybe this approach in general is not so great for our purpose?

I will close the following issue in favor of #303 Please move the discussions about future improvements about MBR LFQ to that issue.

@ypriverol @timosachsenberg This was my old plot for MQ. I don't remember the version. But you can clearly see that if MQ finds a feature it is usually correct in rel. quants. We have more proteins basically everywhere, but out quants are very off in the low concentrations. My interpretation is a significant overestimation of quants in features in lower concentrations. Basically starting from 2500 and below there is no difference in quants for a linked feature anymore.
FDR might help but maybe also a more sensitive quantification is necessary in those low concentration features (since as you can see MQ is able to recover 3-4 proteins with sometimes close to correct rel. quants.
MQ_final

@jpfeuffer do you think the interpolation you mentioned could be an issue? could be easy to check...