Minigraph on gigabase-sized chromosomes
agolicz opened this issue · 9 comments
Hi,
I've been trying minigraph on a plant genome (13 Gb in size, 6*~2Gb chromosomes).
I have a 1 Tb VM, graph construction has now been running for over 24hrs, but I think progress has stalled.
minigraph -cxggs -t40 Ref.v1.chr1L.fasta ragtag.scaffold.fmt.chr1L.fasta > chr1L.minigraph.gfa
[M::main::7.056*1.00] loaded the graph from "Ref.v1.chr1L.fasta"
[M::mg_index::109.032*1.46] indexed the graph
[M::mg_opt_update::111.394*1.45] occ_max1=100; lc_max_occ=4
[M::ggen_map::119.676*1.42] loaded file "ragtag.scaffold.fmt.chr1L.fasta"
Is minigraph expected to work on chromosome ~1.8 Gb in length?
You are using the version from github. This version hasn't been optimized for repetitive genomes. For now, you may try
minigraph -xggs -t40 -U10,50 Ref.v1.chr1L.fasta ragtag.scaffold.fmt.chr1L.fasta > out.gfa
If still too slow, try 0.16 from the release page.
Is your genome public? If it is, I would like to tune minigraph for large plant genomes as well. Thanks.
Many thanks! I'll try 0.16. One of the genomes is publicly available under:
https://projects.au.dk/fabagenome/genomics-data
I have sent a download link to the other one (which will be available soon) via email.
Please let me know if there are any issues accessing the data.
I have got the email and downloaded your assembly. Thank you!
I will try to tune minigraph parameters, but this will take time for large genomes. I will keep you posted if I make any progress. Of course I will only use your data for debugging purposes.
It just occurred to me that minigraph is slow here mainly because your scaffold is 1.8Gb in length. Minigraph parallelize alignment by sequence. When you have a single sequence in the input, minigraph is effectively running on one thread.
I have just run minigraph (commit 5a0492e on the current github HEAD) on chr1L but with scaffolds broken into contigs.
minigraph -cxggs -t40 Vfaba.ref.chr1L.fa Vfaba.scaf.chr1L.fa > 1.gfa 2> 1.gfa.log
It took minigraph 20 min to generate the graph. Nonetheless, for species of high diversity, it is recommended to add option -j.1
. With this option, minigraph is fairly slow. It will probably take ~5 hours. I will try to optimize this part.
Great, thanks! That explains the only 100% CPU usage, which surprised me. Is there any disadvantage to running it on contigs rather than pseudomolecules? I suppose in that case it would make the most sense to split and build the graph by chromosome.
I have just released v0.18 which should have improved performance for diverged samples like yours. I could get a graph in half an hour with the following command line (EDIT: I split your ragtag scaffold into contigs):
minigraph -cxggs -j.1 -t40 Vfaba.ref.chr1L.fa Vfaba.contigs.chr1L.fa
Is there any disadvantage to running it on contigs rather than pseudomolecules?
It is preferred to have the reference in scaffolds and other samples in contigs.
I am closing this issue. If you have follow up questions, please feel free to ask here, reopen this issue or start a new issue.
Many thanks. I tried the v0.18 on the full assembly and it appears to have worked well.
The graph size is about 116% of the reference assembly, which is within the expected range.
For QC purposes I have one additional question (we've had problems using different mappers on these assemblies presumably due to repetitive nature). Would it be possible to extract/output information on the % of sequence (even somewhat roughly) that could be mapped to the graph? The log file just says 'mapped 14378 sequence(s) to the graph' - this is the number of contigs in the second input file.
And a related question. What happens to the sequences that cannot be aligned?
I did some tests on how much of the alternative sequence makes it to the graph based on sample1-specific genes. Annotation was transferred from the sample1 onto ref assembly to see how many genes are accession-specific and then onto linearized graph to see how many of the sample1-specific genes are in the graph. Only about 46% of sample1-specific genes makes it into the graph with -j1. Any idea what could be done to improve that?
minigraph-0.18/minigraph -cxggs -j1 -t40 ref.fasta sample1.fasta > faba.minigraph.gfa 2>error
gfa2fa -s faba.minigraph.gfa > faba.minigraph.fa
#lifting sample1.gff3 onto sample1.fa - control that liftoff works- all features should be lifted
liftoff -g sample1.gff3 -o sample1.check.gff3 -p 50 -u unmapped_features.sample1.txt -f features.txt sample1.fasta sample1.fasta
wc -l unmapped_features.sample1.txt
1 unmapped_features.sample1.txt
#lifting sample1.gff3 onto ref.fa - sample1 specific genes should not be lifted and will be in unmapped_features.ref.txt file
liftoff -g sample1.gff3 -o ref.gff3 -p 50 -u unmapped_features.ref.txt -f features.txt ref.fasta sample1.fasta
wc -l unmapped_features.ref.txt
942 unmapped_features.ref.txt #942 sample1 specific not in ref
#lifting sample1.gff3 onto faba.minigraph.fa to see how many sample1 specific genes are in the graph
liftoff -g sample1.gff3 -o graph.gff3 -p 50 -u unmapped_features.graph.txt -f features.txt faba.minigraph.fa sample1.fasta
wc -l unmapped_features.graph.txt
513 unmapped_features.graph.txt #513 sample1 specific not in the graph