nanoporetech/pomoxis

Running mini_assemble with large, high coverage fastq file

Closed this issue · 5 comments

Hi,
I had a similar issue (I think its similar). I ran mini_assemble for a very large fastq file that has high coverage. The script ran but in the end the assembly output was an empty fasta file.
I also noticed by using top that despite using the -t 12 parameter with mini_assemble, minimap2 was running on only one thread, and I also noticed that the minimap2 command was coming out with -t12 rather than -t 12 as is described in the minimap2 manual.

In addition noticed in the screen output that minimap2 not returning any output, so I ran it independently and saw that it was killed due to lack of memory. I re-ran it with a smaller batch size using the -K parameter and it worked.
My question is how can I do this with mini_assemble? Is it possible to control the minimap2 batch size?
The command and screen output of my original mini_assemble run is below.

Thanks,
Avital

(pomoxis) (base) biomesh@biomesh:~/fastq$ ../../../../../../usr/bin/time -v mini_assemble -i test3_filt_q80minLen500.fq -o assembledminLen500 -p test3_filt_q80minLen500_assm -t 12
Copying FASTX input to workspace: test3_filt_q80minLen500.fq > assembledminLen500/test3_filt_q80minLen500_assm.fa.gz
Skipped adapter trimming.
Skipped pre-assembly correction.
Overlapping reads...
[M::mm_idx_gen::12.4341.65] collected minimizers
[M::mm_idx_gen::14.348
2.73] sorted minimizers
[M::main::14.3482.73] loaded/built the index for 211416 target sequence(s)
[M::mm_mapopt_update::14.729
2.69] mid_occ = 5927
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 211416
[M::mm_idx_stat::14.9612.66] distinct minimizers: 25289437 (69.09% are singletons); average occurrences: 8.810; average spacing: 2.921
Assembling graph...
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.000
21.43] read 0 hits; stored 0 hits and 0 sequences (0 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.00016.87] 0 query sequences remain after sub
[M::ma_hit_cut::0.000
13.81] 0 hits remain after cut
[M::ma_hit_flt::0.00013.55] 0 hits remain after filtering; crude coverage after filtering: -nan
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.000
12.93] 0 query sequences remain after sub
[M::ma_hit_cut::0.00012.57] 0 hits remain after cut
[M::ma_hit_contained::0.000
12.26] 0 sequences and 0 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 0 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 0 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.3-r179
[M::main] CMD: miniasm -s 100 -e 3 -f test3_filt_q80minLen500_assm.fa.gz test3_filt_q80minLen500_assm.paf.gz
[M::main] Real time: 3.283 sec; CPU: 3.280 sec
Running racon read shuffle 1...
Running round 1 consensus...
[M::mm_idx_gen::0.0020.63] collected minimizers
[M::mm_idx_gen::0.002
2.10] sorted minimizers
[M::main::0.0022.10] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.002
2.09] mid_occ = 1
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.0032.07] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan
[M::worker_pipeline::4.279
3.98] mapped 162861 sequences
[M::worker_pipeline::5.1114.21] mapped 48555 sequences
[M::main] Version: 2.14-r883
[M::main] CMD: minimap2 -t12 test3_filt_q80minLen500_assm.gfa.fa.gz test3_filt_q80minLen500_assm.fa.gz
[M::main] Real time: 5.111 sec; CPU: 21.517 sec; Peak RSS: 0.629 GB
[racon::Polisher::initialize] error: empty target sequences set!
Running round 2 consensus...
[M::mm_idx_gen::0.000
3.03] collected minimizers
[M::mm_idx_gen::0.0014.88] sorted minimizers
[M::main::0.001
4.86] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.0014.75] mid_occ = 1
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.001
4.65] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan
[M::worker_pipeline::4.2084.04] mapped 162861 sequences
[M::worker_pipeline::5.006
4.28] mapped 48555 sequences
[M::main] Version: 2.14-r883
[M::main] CMD: minimap2 -t12 racon_1_1.fa.gz test3_filt_q80minLen500_assm.fa.gz
[M::main] Real time: 5.006 sec; CPU: 21.434 sec; Peak RSS: 0.629 GB
[racon::Polisher::initialize] error: empty target sequences set!
Running round 3 consensus...
[M::mm_idx_gen::0.0002.93] collected minimizers
[M::mm_idx_gen::0.001
4.10] sorted minimizers
[M::main::0.0014.09] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.001
3.99] mid_occ = 1
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.0013.90] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan
[M::worker_pipeline::4.255
4.05] mapped 162861 sequences
[M::worker_pipeline::4.8884.41] mapped 48555 sequences
[M::main] Version: 2.14-r883
[M::main] CMD: minimap2 -t12 racon_1_2.fa.gz test3_filt_q80minLen500_assm.fa.gz
[M::main] Real time: 4.889 sec; CPU: 21.536 sec; Peak RSS: 0.628 GB
[racon::Polisher::initialize] error: empty target sequences set!
Running round 4 consensus...
[M::mm_idx_gen::0.000
3.57] collected minimizers
[M::mm_idx_gen::0.0016.03] sorted minimizers
[M::main::0.001
6.00] loaded/built the index for 0 target sequence(s)
[M::mm_mapopt_update::0.0015.82] mid_occ = 1
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 0
[M::mm_idx_stat::0.001
5.61] distinct minimizers: 0 (-nan% are singletons); average occurrences: -nan; average spacing: -nan
[M::worker_pipeline::4.2953.96] mapped 162861 sequences
[M::worker_pipeline::5.155
4.18] mapped 48555 sequences
[M::main] Version: 2.14-r883
[M::main] CMD: minimap2 -t12 racon_1_3.fa.gz test3_filt_q80minLen500_assm.fa.gz
[M::main] Real time: 5.156 sec; CPU: 21.538 sec; Peak RSS: 0.628 GB
[racon::Polisher::initialize] error: empty target sequences set!
Waiting for cleanup.
rm: cannot remove 'shuffled*': No such file or directory
rm: cannot remove 'paf': No such file or directory
Final assembly written to assembledminLen500/test3_filt_q80minLen500_assm_final.fa. Have a nice day.
Command being timed: "mini_assemble -i test3_filt_q80minLen500.fq -o assembledminLen500 -p test3_filt_q80minLen500_assm -t 12"
User time (seconds): 20353.59
System time (seconds): 53.28
Percent of CPU this job got: 1082%
Elapsed (wall clock) time (h:mm:ss or m:ss): 31:24.97
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 63910484
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 14171
Minor (reclaiming a frame) page faults: 25610878
Voluntary context switches: 220518
Involuntary context switches: 2653733
Swaps: 0

cjw85 commented

Hi @avitalsteiman

I have pushed a change that adds a -K option to the mini_assemble program. This option is passed to all calls of minimap2.

Amazing! Thank you!

cjw85 commented

Hi,

As well as the commands you have run, you will also need to run

python setup.py install

from the pomoxis directory to have the updated program available for use.