SAGE failing in big database

Question

SAGE failing in big database

Opened this issue 7 months ago · 13 comments

Description of the bug

tail -n 500 -f /hps/nobackup/juan/pride/reanalysis/proteogenomics/PXD010154/work/b6/6e37bc1f0f149a2785a8bd4d3968ab/out_17_sage.log 
Found Sage version string: Version 0.14.3
Creating temp file name...
Creating Sage config file.../tmp/20231123_073725_hl-codon-bm-10.ebi.ac.uk_43_1.json
Sage command line: /usr/local/bin/sage /tmp/20231123_073725_hl-codon-bm-10.ebi.ac.uk_43_1.json -f GRCh38r110_GCA97s_noncanonical_proteins_19Jul23-decoy.fa -o . --write-pin 01524_B03_P015424_S00_N18_R1.mzML 01277_E01_P013129_S00_N05_R1.mzML 01283_A01_P013187_S00_N01_R1.mzML 01320_E04_P013559_S00_N29_R1.mzML 01296_D03_P013201_S00_N19_R1.mzML 01093_C02_P010748_S00_N11_R1.mzML 01279_A01_P013163_B00_N01_R1.mzML 01500_H04_P015160_S00_N32_R2.mzML 01698_A04_P018021_S00_N25_R1.mzML 01077_A03_P010695_S00_N17_R1.mzML 01501_A02_P015161_S00_N09_R1.mzML 01323_C05_P013562_S00_N35_R1.mzML 01276_E04_P013128_S00_N28_R1.mzML 01293_F03_P013196_S00_N22_R1.mzML
Standard output: Running: /usr/local/bin/sage /tmp/20231123_073725_hl-codon-bm-10.ebi.ac.uk_43_1.json -f GRCh38r110_GCA97s_noncanonical_proteins_19Jul23-decoy.fa -o . --write-pin 01524_B03_P015424_S00_N18_R1.mzML 01277_E01_P013129_S00_N05_R1.mzML 01283_A01_P013187_S00_N01_R1.mzML 01320_E04_P013559_S00_N29_R1.mzML 01296_D03_P013201_S00_N19_R1.mzML 01093_C02_P010748_S00_N11_R1.mzML 01279_A01_P013163_B00_N01_R1.mzML 01500_H04_P015160_S00_N32_R2.mzML 01698_A04_P018021_S00_N25_R1.mzML 01077_A03_P010695_S00_N17_R1.mzML 01501_A02_P015161_S00_N09_R1.mzML 01323_C05_P013562_S00_N35_R1.mzML 01276_E04_P013128_S00_N28_R1.mzML 01293_F03_P013196_S00_N22_R1.mzML

Standard error: [2023-11-23T07:37:25Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [0.984016]. This will become a HARD ERROR by v0.15
[2023-11-23T07:37:25Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [0.984016]. This will become a HARD ERROR by v0.15
[2023-11-23T07:37:25Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [15.994915]. This will become a HARD ERROR by v0.15
[2023-11-23T07:37:25Z WARN  sage_core::modification] Variable modifications must be specified as a list of modifications: [42.010565]. This will become a HARD ERROR by v0.15
[2023-11-23T07:37:33Z TRACE sage_core::database] digesting fasta
[2023-11-23T07:37:55Z TRACE sage_core::database] modifying peptides
[2023-11-23T07:42:35Z TRACE sage_core::database] sorting and deduplicating peptides

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
Process '/usr/local/bin/sage' crashed hard (segfault-like). Please check the log.

SageAdapter took 06:50 m (wall), 0.12 s (CPU), 0.08 s (system), 0.04 s (user); Peak Memory Usage: 42 MB.```


### Command used and terminal output

_No response_

### Relevant files

_No response_

### System information

_No response_

Answer 1 · 2023-11-23T10:03:53.000Z

The dataset is PXD010154 and the database is: http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomes/proteogenomics/noncanonical-tissues-2023/GRCh38r110_GCA97s_noncanonical_proteins_19Jul23-decoy.fa

Answer 2 · 2023-11-23T10:10:15.000Z

Ping here, @lazear this is the same dataset we discussed before but with a larger database, all my jobs are failing, Im splitting with quantms into 100 jobs with 10 files per job and even in 700GB memory with 48 CPUs it is failing.

Answer 3 · 2023-11-23T17:45:57.000Z

If the database is too big, using more jobs will be worse since sage has to build the index separately in each job. Sage is designed for extremely efficient single node, single job tasks, so splitting into smaller jobs will almost always be worse than bigger CPU jobs with more files.

The database needs to be reduced in size, and jobs run over the database in chunks, and then combine the results at the end

lazear/sage#97

Answer 4 · 2023-11-23T19:34:07.000Z

@lazear thanks for the quick reply. It will take me some time to implement a chuck solution in quantms. In the meantime, do you have an estimation (even if is far from exactly) How much memory in the node is needed for a 2GB FASTA database?

Answer 5 · 2023-11-24T21:30:31.000Z

I don't have an estimation, but it's a very bad idea. Fragment indexing is largely memory bandwidth bound - larger databases cause dramatic slowdowns and the algorithm stops being able to saturate CPUs at some level (around 10x larger than human canonical uniprot sized with trypsin + 2 missed cleavages). Database splitting is absolutely necessary for good performance at a database of that size

Answer 6 · 2023-11-25T08:36:22.000Z

@jpfeuffer @daichengxin @timosachsenberg How can we implement in quantms a logic to partition the database for searches where extensive databases are searched? I understand from @lazear here that it will not be possible for SAGE to have a way around large databases. However, @lazear is suggesting that instead we partitionable the database, and search all the raw files against each chuck, in this case quantms will not be partitioning/distributing files but data (fasta files). I think this is doable but we need the following:

Database creation: Make sure the target and decoy ratio in every chuck is the same, I think for MSGF and COMET we do not do any filtering with the target and decoys, it is just the id part, for SAGE internally probably used them, then we need to make sure the ratio of decoys and targets is 50/50 for every chuck. One possibility is to not allow the user to provide decoy databases, and the tool for database creation makes sure it creates the database chucks correctly. Ideas @timosachsenberg @lazear @jpfeuffer @daichengxin.

NOTE: It may be interesting to use the pypgatk tool, which is the tool I use for proteogenomics database creation, and it has an interesting way to create the decoys without redundancy using decoypyrat approach. We can easily add it as a chunk option.

After SAGE/COMET/MSGF search, it should be easy to merge all the databases chucks and continue all the steps of quantms using the big database. I have tested for the given database 2GB all the other steps and they do work perfectly well.

What do you think?

Answer 7 · 2023-11-29T14:06:22.000Z

It is hard to predict from the outside which/how many sequences to put into the individual chunks. I think in Sage one could (in principle) stop generating the fragment index, perform the search, and then later merge results from individual chunks. This might lead to some peptides being searched in multiple times but well ... I think MSFragger does something similar.

Answer 8 · 2023-11-29T14:09:31.000Z

Ah I think I misunderstood the issue. I thought it is just about fitting the fragment index into memory, but this is also about distributing it on several nodes?

Answer 9 · 2023-11-29T16:33:55.000Z

What's needed is just a way to distribute different FASTA files across nodes (analogous to distributing different raw files). You could divide your FASTA into 100 parts, and spin up 100 jobs to process each chunk separately (thus dramatically reduced fragment index memory use), and then combine the results. Since you guys already have a pipeline in place for combining results from multiple searches, rescoring, and doing protein inference it should work pretty well - just need to take the best hit from each spectrum, since there will be 100 best hits (1 from each search).

I will also try to work on database splitting, since once semi-enzymatic is officially released I imagine I will get many more requests for it.

Answer 10 · 2023-11-29T16:40:53.000Z

Actually @lazear we were internally discussing the issue and trying to find out a solution and @timosachsenberg point it out that another case is semi-enzymatic analysis which also increase the size of the search space dramatically. We actually think (and we were planning to discuss with you), that it is better to have a solution within SAGE that handle both cases, big databases or semi-enzymatic searches.

The problem with splitting databases, is the target decoy approach, we will need to control the distribution of targets and decoys and also make sure other search engines supported by the tool are also handling well the chunks. We are happy to discuss here with you possible solution because as you said we are already parallelizing the searches by msrun.

Answer 11 · 2023-11-29T16:46:17.000Z

Why would the distribution of targets and decoys matter? If you are performing external competition on combined search results (e.g. taking the highest scoring PSM for that spectrum) it shouldn't matter if you did one search with 100% targets and another with 100% decoys - the end result should be exactly the same. Sage, Comet, etc also will happily generate decoys for you

Answer 12 · 2023-11-29T17:13:35.000Z

I think if the database is split into several parts, we need to make sure that decoys of the same peptides in different chunks are generated deterministically. Otherwise, we might end of up with more decoys per peptide.

Answer 13 · 2023-11-30T06:36:29.000Z

OK, so after doing some hacking I can offer at least a partial solution here.

You'll need enough memory to fit all of the Peptide objects in memory for Sage.

I have modified the codebase so that more than 2^32 peptides can be used, and to defer fragment ion generation - this means that mzMLs can be sorted by precursor mass and processed in chunks - a new fragment ion index will be generated for each chunk that contains only peptides within the monoisotopic mass range of those MS2 precursors. This essentially defers generating the actual theoretical fragments until they are needed, and should eliminate a substantial portion of memory use, but not all of it - the Peptide objects are not terribly efficient (storing the sequence and list of protein groups). For a standard tryptic digest, this will probably be OK, even with a 2GB database. Semi-tryptic at that size.... good luck.

It might be possible to modify Sage further to parse FASTA files in chunks on-the-fly, but that will require more work than I am willing to do right now (at the moment the program basically requires that the full set of Peptide objects are live from start to finish, so new ones cannot be created once a search has started, and there are a variety of places that assume Peptides are unique)