Nesvilab/FragPipe

Questions about diaTracer and endogenous peptides search

Closed this issue · 23 comments

- Describe the issue or question:
Hello,
I'm attempting to use the new diaTracer function in Fragpipe with diaPASEF data for no-enzyme peptidomics. Attaching the log file here. diaTracer seemed to finish and MzML files were generated. Work then progressed through to the MSFragger step when it threw an OutOfMemoryError. Since this is DIA data, I believe I cannot split the data to resolve this. Is there a way to assign more memory or otherwise resolve this issue?

How do I restart this workflow without having to rerun the completed steps? Do I have to restart FragPipe again and can just load the created MzML files, or can I restart even further into the workflow somehow?

Thank you,
Andrew

log_2024-06-22_03-52-47.txt

fcyu commented

Is there a way to assign more memory or otherwise resolve this issue?

For the current version, unfortunately, no.

How do I restart this workflow without having to rerun the completed steps? Do I have to restart FragPipe again and can just load the created MzML files, or can I restart even further into the workflow somehow?

You could load the diaTracer's mzML files as DDA data type, and the original .d folder as DIA-Quant data type. Then, run FragPipe from scratch. It will skip the diaTracer spectra deconvolution.

Best,

Fengchao

So if I cannot give it more memory, I expect I'll keep running into the OutOfMemory issue... would you recommend trying different settings in MSFragger to reduce the memory load where I'm getting the issue or can I just not run so many data files in my study (79)?

Thank you again,
Andrew

fcyu commented

Your configures have a large search space

num_enzyme_termini = 0
variable_mod_01 = 15.9949 M 2
variable_mod_02 = 42.0106 [^ 1
variable_mod_04 = -17.0265 nQ 1
variable_mod_05 = -18.0106 nE 1
variable_mod_07 = 0.98402 N 1
digest_min_length = 8
digest_max_length = 45

Maybe you could reduce some of the settings. Not sure what your sample is so can't tell what is the best strategy.

Best,

Fengchao

Samples are of endogenous serum peptides hence the need for nonenzyme search in MsFragger. So it is running out of memory because of the search space size rather than the number of ions? I'm thinking then to run the workflow twice but change the min-max length range for each (e.g., 1 with 8min-19max length and 2 with 20min-45max length)? I then can combine the DIA-NN results - do you think this would get me past the OutOfMemory issue here?

Many thanks,
Andrew

fcyu commented

So it is running out of memory because of the search space size rather than the number of ions?

Yes, it is mostly due to the large search space, not your LC-MS data size.

I'm thinking then to run the workflow twice but change the min-max length range for each (e.g., 1 with 8min-19max length and 2 with 20min-45max length)? I then can combine the DIA-NN results - do you think this would get me past the OutOfMemory issue here?

Yes, this should work. You could also remove

variable_mod_04 = -17.0265 nQ 1
variable_mod_05 = -18.0106 nE 1
variable_mod_07 = 0.98402 N 1

Best,

Fengchao

Hi Fengchao,

Looking at the end of the log (just before it crashed) I'm perplexed by the "Number of unique peptides" table shown... for a no-enzyme search, I would expect the largest number of unique peptides at a length of 8 and then fewer and fewer as length increases when parsing through a fasta. But it is reporting around the same number (17 million) peptides with lengths 8 through 40 and then rolls off. Am I hitting a limitation (some maximum) in how MsFragger is parsing the fasta here? I want to make sure it is making it through the entire fasta and not arbitrarily stopping when it reaches some maximum at each length. Is there some sort of max fasta file size that can be used in no-enzyme searches?

Thanks again,
Andrew

fcyu commented

What you said is obvious with enzymatic digestion. With non-enzymatic digestion, this trend is not clear. For example, given a protein with length > 9, there are only one fewer peptide comparing length 9 and length 8. Also note that it is "number of unique peptides": peptides with the same sequence are collapsed.

Is there some sort of max fasta file size that can be used in no-enzyme searches?

No.

Best,

Fengchao

Hi Fengchao,

Sharing back good news that with limiting search length to the 8-27 range, I was able to finish the nonspecific-HLA-diaPASEF workflow. I'm curious (wasn't discussed in the paper) why ProteinProphet was selected for FDR (validation) with the flags --sequential --prot 1?

It seems that the data are not being filtered at the protein level anyways, so why have --prot 1 at all? If I want each output file to be filtered to the 5% level, how would I make that change since --sequential does not take a float value?

My project has endogenous peptides that generally result in a single peptide per protein ID. In this case, would PeptideProphet be a more effective tool, or is it really only good for low-res data (not Bruker diaPASEF data)?

Regards,
Andrew

fcyu commented

Glad to hear that it finally works for your data.

why ProteinProphet was selected for FDR (validation) with the flags --sequential --prot 1?

That's because for the endogenous peptides, people normally don't care about the protein level results, and the peptides have already filtered with 1% peptide-level FDR. But yes, we might change it back to --prot 0.01 in the next release to make the result more conservative.

It seems that the data are not being filtered at the protein level anyways, so why have --prot 1 at all?

Because if not have --prot 1, it will use the default setting: --prot 0.01.

If I want each output file to be filtered to the 5% level, how would I make that change since --sequential does not take a float value?

There are several levels of FDR: --psm, --ion, --pep, and --prot. Adjust them as you want. https://github.com/Nesvilab/philosopher/wiki/Filter

My project has endogenous peptides that generally result in a single peptide per protein ID. In this case, would PeptideProphet be a more effective tool, or is it really only good for low-res data (not Bruker diaPASEF data)?

I am not sure if I understand your question correctly, because "a single peptide per protein ID", "PeptideProphet be a more effective tool", and "is it really only good for low-res data" seems to have no causal relationship.

Best,

Fengchao

Thanks again for the guidance today.

In the diaTracer paper, it is suggested that, in addition to the nonspecific-HLA-diaPASEF workflow creating a spectral library from the diaPASEF data, you can also point to an existing spectral library (from prior DDA runs) to be used together for annotating the DIA-NN quant output. Where in FragPipe would I point to the existing library to append with the one being generated from diaPASEF runs? Is it that optional reference box for a library on the DIA-NN tab or somewhere else?

Best regards,
Andrew

fcyu commented

You are welcome!

Where in FragPipe would I point to the existing library to append with the one being generated from diaPASEF runs? Is it that optional reference box for a library on the DIA-NN tab or somewhere else?

It is the Spectral library (optional) panel in the Quant (DIA) tab.

You could also load both of your diaPASEF and ddaPASEF .d folders, specify the diaPASEF as the DIA data type and the ddaPASEF as the DDA data type. Then, FragPipe will search both data types and build a "hybrid" library. This library will be used to perform the quantification during the DIA-NN step. It normally results in more quantified IDs compared to DIA-only or DDA-only library.

Best,

Fengchao

Hello Fengchao,

Having run into earlier "out of memory" issues using diaTracer with MSFragger for no-enzyme searches w/ PTMs (acknowledging the caveats of a large search space), I've moved things to a new box with more ram (384GB for now). I set up a workflow around 40h ago and it is still going (no out of memory errors as of yet), but it is still intensively computing (~80% of 55 threads) at the step where it is determining all possible modified peptides (after it pretty quickly calculates the table of unmodified peptides). Given the size of the search space, it isn't too surprising that it is taking awhile. It has used up to 90% of available RAM (~350GB back down to 50% and then up again to 85%, etc) but hasn't moved past this step in ~37h. I'm happy to continue to wait for it if it is still working, though without any progress indication during this step, it is hard to know how long this will take. As it hasn't thrown any errors and is actively computing, is this a good indication that all's progressing and just needs more time, or is there some way to know that it is stuck (though actively computing)?

Best,
Andrew

fcyu commented

I set up a workflow around 40h ago and it is still going (no out of memory errors as of yet), but it is still intensively computing (~80% of 55 threads) at the step where it is determining all possible modified peptides (after it pretty quickly calculates the table of unmodified peptides).

This step generates all modified peptides. A good sign is that the CPU usage is still high, which means that MSFragger is still generating the peptides, not idle.

Given the size of the search space, it isn't too surprising that it is taking awhile. It has used up to 90% of available RAM (~350GB back down to 50% and then up again to 85%, etc) but hasn't moved past this step in ~37h.

It seems that there is not much free memory left and Java is trying to free some memory frequently, which is time consuming. This may be the reason for it taking such a long time.

As it hasn't thrown any errors and is actively computing, is this a good indication that all's progressing and just needs more time, or is there some way to know that it is stuck (though actively computing)?

I think you could still wait since the CPU usage is still high. After that, MSFragger will write a huge peptide index to the hard drive, which also takes time.

You could check the CPU and disk IO to see if MSFragger is stuck or still running. On the other hand, you could reduce your search space by "simplify" some variable modification and peptide lengths. I can take a look at your workflow if you like.

Best,

Fengchao

Great that CPU is a good indication that it is still working - has been continuous from 65% to 100% utilization (been a good stress test of the new box). I thought that at 90% ram it surely switched to paging, but it isn't writing really anything yet, not to the pagefile or elsewhere. I'm assuming it would show high disk I/O if it was doing "the-memory-shuffle" or maybe it is so infrequent that I just haven't caught it yet?

I can simplify MSFragger params (easily break it into two or three ranges of AA lengths) but mean to test what we can do with the workflow if not resource limited. I admit that I was sure 384GB would be more than plenty so was pretty shocked when it quickly used up 90%. Hopefully 768GB will do better (more rdimms on order) but will split things up if needed.

Best,
Andrew

fcyu commented

Yes, more memory is always better in general, but there are caveats. If the number of modified peptides is larger than the maximum integer in Java (~2 billion), the program will crash no matter how much memory is available, because it is a hard limit in Java. Some Java version also choke when there are much memory (we saw the weird error before but not sure if it is still true).

One more thing, if you set many variable modifications on the same amino acid, MSFragger would be slow in generating the modified peptides.

Best,

Fengchao

Great feedback, thank you! Yeah, it's calculated 657 million unmodified peptides with my PTM params giving a ~9.5:1 ratio of modified to unmodified peptides. i.e., ~6 billion modified peptides expected. But, surprisingly, it hasn't crashed yet... still computing something.

If I'm understanding, you're talking about the max limit of 2.1 billion for a 32-bit integer? FragPipe v22 uses 64-bit java and a max 64-bit integer limit of 9.2e18. Is the code still using 32-bit integer definitions or maybe this has been updated to 64-bit integers so it shouldn't crash (because of that at least)?

Or should I kill the search...

Best,
Andrew

fcyu commented

The bit sizes of the primitive data types are always the same in Java. The 32-bit vs 65-bit versions affect the maximum memory size.

Best,

Fengchao

fcyu commented

I don't suppose there's any easy way to switch to long data type instead (64-bit integer type)?

We also hope Java can use 64-bit for the array index, but, well.....

Best,

Fengchao

fcyu commented

wouldn't switching the array index to the long data type (one of Java's primitive data types) resolve the issue?

Theoretically, yes, but Java just use 32-bit int for it, which we can't change. I am pretty sure Google or ChatGPT will give you more technical details.

Best,

Fengchao

Hi Fengchao,

Having killed the prior workflow, it is now finishing up a broad no-PTM non-enzyme search. I'd like to go back and do the PTM analysis and will break it up into small ranges of AA length to do my best to keep to 2 billion mod sequences. Since the GUI has no queue, is there an easy way to see the CLI commands that the GUI is using to run this workflow (to copy / edit in a batch script)?

Best,
Andrew

fcyu commented

Yes, FragPipe actually has a command line interface https://fragpipe.nesvilab.org/docs/tutorial_headless.html

Best,

Fengchao

Where in FragPipe would I point to the existing library to append with the one being generated from diaPASEF runs? Is it that optional reference box for a library on the DIA-NN tab or somewhere else?

It is the Spectral library (optional) panel in the Quant (DIA) tab.

You could also load both of your diaPASEF and ddaPASEF .d folders, specify the diaPASEF as the DIA data type and the ddaPASEF as the DDA data type. Then, FragPipe will search both data types and build a "hybrid" library. This library will be used to perform the quantification during the DIA-NN step. It normally results in more quantified IDs compared to DIA-only or DDA-only library.

Hi Fengchao,
In my FragPipe workflow for diaPASEF/diaTracer to no-enzyme peptidomics MsFragger search and dia-nn quant, I referenced a separate library (from DDA data, generated in Bruker PaSER) within the "Spectral Library (optional)" field on the Quant (DIA) tab (as suggested above). Dia-nn crashed when trying to read that optional library file (error code -1073740791).

I've compared my DDA-generated library to the one that FragPipe made ("library.tsv") and the only dia-nn required column that differed was for RT. The header names and the way RT was calculated differed.

I'm assuming it was the different names for the RT column headers that likely caused it to crash? I can change that, but having RT values calculated differently would likely prevent dia-nn from properly using both libraries together?

Many thanks again for all your help.
Best,
Andrew

fcyu commented

I'm assuming it was the different names for the RT column headers that likely caused it to crash?

I am not quite sure about it. I guess you could trail and error.

I can change that, but having RT values calculated differently would likely prevent dia-nn from properly using both libraries together?

Have different scale if the RT should be fine because DIA-NN will redo the alignment.

Best,

Fengchao