makovalab-psu/NoiseCancellingRepeatFinder

Reduce the memory requirement

Opened this issue · 21 comments

baozg commented

Hi,

I want to use NCRF to find a 9kb repeat in the ONT reads, but for ~2000 reads, NCRF may use more than 20Gb mem. Is there any chance to reduce the requirement?

20 G does seem surprising.

The number of reads shouldn't affect this (assuming there's no memory leak), since each is processed separately (and serially). Memory use should be related to the longest read and the number of repeats found.

When you say "9kb repeat" do you mean that's the expected length of the repeat cassette, or do you mean that's the length of the repeat unit?

baozg commented

9kb was the size of the repeat unit. I only keep the >50kb ONT reads for NCRF.

(my responses have been delayed because my browser is having trouble loading github this morning)

I have to say that's a much longer repeat unit than I envisioned during design. I didn't think about anything longer than about 200 bp. If I remember correctly, the project this was developed for didn't have anything longer than 32bp.

It's been five years since then, so it would take me some time looking through the code to re-familiarize myself with how I implemented things. The DP matrix used by the alignment core is (probably) 90 times larger for a 9kbp unit than for a 100bp unit. Maybe I'm allocating (read length)X(unit_length) matrix entries for that (just guessing at this point).

It's also possible the memory use problem is in the internal post-processing stages. While the core is implemented in C, a lot of the post-alignment processing is implemented in python, and I doubt I gave much consideration to memory use.

By the way, could you post the command pipeline you are using?

baozg commented

Thanks for your quick response. Here is the command I use

cat UL_50kb/NCRF/split_reads/split_reads_{}.fasta | NCRF unit:`seqkit seq -w 0 unit.fa |grep -v ">"` --minlength=9000 > UL_50kb/NCRF/ncrf_report/split_reads_{}.ncrf.report

Some parts have results, but many of them was killed by high memory requirement.

4e50a99e-2571-4622-8ca4-57921c28c137 107088 107039bp 23-107062   CCACTACTTTTAACGTTATTTTACTTATTCCGTGAATCGGAAGC-GAG-AATGCTGCCCTCTTTTTGGACCAAAGGCTCGCTTCGCGGAGCCGAT-C-GGCGGAAGACATTGTCGAGGTGGGGAGT
unit+                                       107902bp score=45224 CCACTACTTTTAACGTTATTTTACTTATTCCGTGAATCGGAAGCGGgGCAcTGC--CCCTCTTTTTGGACCcAAGGCTCGCTTCGCGG-GCCGATCCGGGCGGAAGACATTGTC-AGGTGGGGAGT

d1ca28e9-086a-463a-9ea7-571a813cf4b8 63032 62751bp 274-63025   CGCACCACGA-GCG-TGTCTGCGAGTCGGGTTGTTTGGGAATGCGGCCCGGATCGGGCGGTGAATTCCGTCCAAGGCTAAATGC-GGC-AGAGACCG-T-GTGAGCAAGT-GCGCGAGGGAAAGATG-
unit+                                      63355bp score=22880 CGCACCACGAGGCGCTGTCTaCGAGTCGGGTTGTTTGGGAATGCaGCCCaaATCGGGCGGTGAATTCCGTCCAAGGCTAAATaCTGGCGAGAGACCGATAGcGAaCAAGTAcCGCGAGGGAAAGATGA

c9c361ed-7fa7-4ebc-9da0-6015bb9014d2 101901 101761bp 139-101900  GGGACGGTGGGAATCTCGTTAATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACT-CCGCCGTTTA-CCGCGCTTGGTTG-A-GTC-T-ACTG
unit-                                       102397bp score=70441 GGGACaGTGGGAATCTCGTTAATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTGGCTACCTTAAGAGAGTCATAGTTACTCCCGCCGTTTACCCGCGCTTGGTTGAATtTCTTCACTt

6839ceb3-a5c5-4c0c-9824-a9efb7b06170 114267 114226bp 29-114255   ACGACGTAGCACGGACGAGCCATGCATGCCATCAAGCGCCCGCGCGGCCCGCGCCTACTAAGCCCACGGGACGCCCTCGGCGT-CG-C-GCCGTTGCCCACTCGACCCGTCGACGTCGGCCAGAGC
unit-                                       115222bp score=36311 ACGACGTAGCACGGACGAGCCATGCATGCCATCAAGCGCCCaCGCaG--CaCGCCTACTAAGCCCACaGGACGCCCTCGGCGTCCGCCTGCCGTTGCCCACTCGACCCGTCGACGTC-GCC-GA-C

2b049cef-311d-47b2-bde7-5dc2d42cfcc8 52169 51992bp 137-52129   GTGTTTAAAGCGAGTTCGCGGGTCGTTCTGCTGTGCAGGTTTCGACAATGATCCTTCCGCAGGTTCACCTACGGGAAACCTTGATACGACTTCTCCTTCCTCTAAATGATAAGGTTCAATGGACTTCT
unit-                                      52351bp score=31644 GTGTTTAAAaCGAGTTCGCGGGTCGTTCTGCTGTGCAGGTTTCGACAATGATCCTTCCGCAGGTTCACCTAC-GGAAACCTTGtTACGACTTCTCCTTCCTCTAAATGATAAGGTTCAATGGACTTCT

fe08f14a-631c-439f-9e64-3c472a63f441 55930 54834bp 853-55687   TGGTCGGACGTCGGAACTT-CTGT-CTTGCATACCTACTGCCTGGGCATTGTGCACGTGCAAAACGGGCAACGCTTGCG-CCCTCGCATCCCGTGCGCGGGGTG---GCAGACAGACGCTCTCGCGTC
unit+                                      55179bp score=26221 TGGTCGGACGTCGGAACTTCCTGTGC-TGCATACCTACTGCCTaGGCATTGTGCACGTGC-AAACGGtC-gC-CTTtCGCCCCTCGCATCCCaTGCGCGGGGTGAACcCA-AaAGACGCTCTCGCGTC

15c93f78-7770-4322-9eb2-1d22c5d8aae6 59443 59202bp 71-59273    TAAGCGCGCGACCTACA-CCGGCCGTCGGGGCAAGTGCCGAGG--CCGATGAGTAGGAGGGCGCGGCGGTCGGTTTGCAAAACCTTGGGCGCGAGCACGGGGCGGAAGCGGCGCGTGTCGGTGC-GAA
unit+                                      59746bp score=24065 TAAGCGCGCGACCTACACCCGGCCGTCGGGGCAAGTGCC-AGGCCCCGATGAGTAGGAGGGCGCGGCGGTC-G-cTGCAAAACCTTGGGCGCGAGC-CtGGGCGG-AGCGGC-C--GTCGGTGCAG-A

706438ea-2226-4cfc-b0cf-b1e775b50fae 58605 58540bp 54-58594    AGAATCATAATGAAAATGTACAACCCCAATTATGCGCGGTCTAAGGTACAAACACACATCGGCCTTCATAATTGACTTT-A-ATGTATAAAAAAAT-GTTTTCAACAATTTTTTTTTGATTTTTTATT
unit-                                      58738bp score=29271 AGAATCATAATGAAAATGTACAA-CCCAAaTATGCGCGGTCTAAGGTACAAACACACATCaGCCTTCATAATTGACTTTAATATGTATAAAAAAATAtTTTTCAACAA-TTTTTTTTGA-TTTTTATT

Some parts have results, but many of them was killed by high memory requirement.

What error message do you see? I'm trying to understand whether the system reported a problem, or if NcRF failed to allocate memory, reported it, and quit.

baozg commented

Most of them was killed by the cluster and some of them have DP cells error

/usr/bin/bash: line 1: 140214 Broken pipe             cat UL_50kb/NCRF/split_reads/split_reads_001.fasta
     140215 Killed                  | NCRF unit:`seqkit seq -w 0 unit.fa |grep -v ">"` --minlength=9000 > UL_50kb/NCRF/ncrf_report/split_reads_001.ncrf.report
failed to re-allocate 37,153,540,815 bytes for 2,476,902,721 DP cells

Thanks, that is helpful. The "failed to re-allocate" is generated by NcRF. And it confirms my earlier guess that the problem is the amount of memory needed for the DP matrix. That will give me a place to start.

Thinking out loud now ... 37,153,540,815 / 2,476,902,721 is 15, so whatever I am using to represent a cell takes 15 bytes. Possibly I can reduce that.

There will be ≈9K columns in the DP matrix. 2,476,902,721 / 9,000 ≈ 275K, so it's trying to allocate enough for 275K rows. That would accommodate a repeat of length up to 275K. Typically I start with a small buffer and reallocate, increasing by some percentage, when it gets full. In some implementations of the malloc/realloc library, this paradigm can leave unrecovered chunks in the heap. So it's possible this causes more memory to be allocated than is needed (I think such implementations don't exist these days, and are more likely found on more primitive OSs in the 1990s).

One solution to this, which I've used in other programs, is to provide a command-line option to allow the user to specify how much memory to allocate, up front (as opposed to starting with a small amount). And ... checking ... I did implement these for NcRF. If you do --help=allocation you'll see a list of such options.

Try using something like --allocate:aligner=15G and --allocate:sequence=L, where L is bigger than your longest read. That might help. Might not completely solve the problem though.

baozg commented

The length of the longest read is 731,600 bp. I will try this and see what the output of NCRF. Another question, is --allocate:aligner=15G meaning the maximum mem of NCRF will be 15G? I only have 128C 256G node, so I only could do is to reduce the total running process (before setting I run 12 NCRF processes simultaneously)

I'd have to dig into the code to be sure (I'm working on other stuff at the moment), but I think --allocate:aligner=15G probably means the amount of memory allocated for the DP matrix. This just means it will allocate that much for the initial allocation of that object. If the object fills up NcRF is going to try to enlarge it. But if your problem is caused by heap fragmentation, this should help.

baozg commented

Thanks a lot! I will try this and let you know the update

You could also add --debug=allocation (not documented). This will report details about all the DP allocation events. That would help me better understand how the DP matrix is being allocated, and when.

baozg commented

Add it to the test! I also could share with you some problematic reads and unit sequence if you have time to debug this ( although maybe our aim is too far away from your design

baozg commented

The same error occurred and 15G should be --allocate:aligner=16106127360 --allocate:sequence=1048576. It should be bytes

NcRF parsing should treat --allocate:aligner=16106127360 and --allocate:aligner=15G exactly the same. It interprets the G as a suffix for a billion, but knows that for memory units those suffix letters are typically interpreted by multiple of 1024 instead of 1000.

I can look at the reads that are causing problems (and I'd be glad to have them), but understand I may not have much time to devote to this.

Thinking of possible workarounds ...

What parts of the output are you using downstream? If it is only the intervals occupied by your repeats, you could split the reads into smaller overlapping pieces, then merge the reported intervals. Piece size would be whatever the size is of the longest read that doesn't fail. Overlap would be bigger than your repeat unit — I'd say twice the repeat unit length.

baozg commented

I am trying to use centroFlye (https://github.com/seryrzu/centroFlye) actually, which use NCRF to find centromere repeat (~180bp) of human in that case. Unfortunately, it was discontinued, but I guess centroFlye wants to get information about intervals from the code

I see.

I did a little digging through their code. I do find some evidence that they use more than just the intervals from the NcRF output. For example their scripts/ncrf_parser.py has a function to extract slices of an NcRF-reported alignment. It would take more digging to figure out if that's just something they have for debug, or if it is integral to their functionality.

Since centroFlye is an assembler, you could just try splitting the problem reads but without any overlap. (The overlap would potentially bias decisions the assembler makes). This probably would reduce sensitivity slightly.

baozg commented

Thanks for your suggestion. We are trying to assemble repeat-rich regions, the all-to-all overlap struggled with those reads. If we break the reads, then we lose the information.

OK, understood.

The only viable workaround I can think of is to create a wrapper around NcRF that (a) processes a "short enough" read by running NcRF, (b) splits long reads in to overlapping pieces, runs NcRF on each, and merges the resulting output so that it looks like it was created by a single run of NcRF. Then make what should be a minor mod in centroFlye to invoke the wrapper instead of NcRF.

Not sure how difficult the merge operation would be. There's code in the NcRF post-processing scripts that examines the alignment (as output by NcRF) on a motif-by-motif basis, which might be useful.

baozg commented

Thanks for your suggestion. I would love to try it.