Minigraph on gigabase-sized chromosomes

Question

Minigraph on gigabase-sized chromosomes

agolicz opened this issue 3 years ago · 9 comments

Hi,
I've been trying minigraph on a plant genome (13 Gb in size, 6*~2Gb chromosomes).
I have a 1 Tb VM, graph construction has now been running for over 24hrs, but I think progress has stalled.

minigraph -cxggs -t40 Ref.v1.chr1L.fasta ragtag.scaffold.fmt.chr1L.fasta > chr1L.minigraph.gfa
[M::main::7.056*1.00] loaded the graph from "Ref.v1.chr1L.fasta"
[M::mg_index::109.032*1.46] indexed the graph
[M::mg_opt_update::111.394*1.45] occ_max1=100; lc_max_occ=4
[M::ggen_map::119.676*1.42] loaded file "ragtag.scaffold.fmt.chr1L.fasta"

Is minigraph expected to work on chromosome ~1.8 Gb in length?

Answer 1 · 2022-04-25T12:39:23.000Z

You are using the version from github. This version hasn't been optimized for repetitive genomes. For now, you may try

minigraph -xggs -t40 -U10,50 Ref.v1.chr1L.fasta ragtag.scaffold.fmt.chr1L.fasta > out.gfa

If still too slow, try 0.16 from the release page.

Is your genome public? If it is, I would like to tune minigraph for large plant genomes as well. Thanks.

Answer 2 · 2022-04-25T13:07:15.000Z

Many thanks! I'll try 0.16. One of the genomes is publicly available under:
https://projects.au.dk/fabagenome/genomics-data

I have sent a download link to the other one (which will be available soon) via email.
Please let me know if there are any issues accessing the data.

Answer 3 · 2022-04-25T13:17:54.000Z

I have got the email and downloaded your assembly. Thank you!

I will try to tune minigraph parameters, but this will take time for large genomes. I will keep you posted if I make any progress. Of course I will only use your data for debugging purposes.

Answer 4 · 2022-04-25T19:25:50.000Z

It just occurred to me that minigraph is slow here mainly because your scaffold is 1.8Gb in length. Minigraph parallelize alignment by sequence. When you have a single sequence in the input, minigraph is effectively running on one thread.

I have just run minigraph (commit 5a0492e on the current github HEAD) on chr1L but with scaffolds broken into contigs.

minigraph -cxggs -t40 Vfaba.ref.chr1L.fa Vfaba.scaf.chr1L.fa > 1.gfa 2> 1.gfa.log

It took minigraph 20 min to generate the graph. Nonetheless, for species of high diversity, it is recommended to add option -j.1. With this option, minigraph is fairly slow. It will probably take ~5 hours. I will try to optimize this part.

Answer 5 · 2022-04-25T20:10:14.000Z

Great, thanks! That explains the only 100% CPU usage, which surprised me. Is there any disadvantage to running it on contigs rather than pseudomolecules? I suppose in that case it would make the most sense to split and build the graph by chromosome.

Answer 6 · 2022-05-09T15:18:05.000Z

I have just released v0.18 which should have improved performance for diverged samples like yours. I could get a graph in half an hour with the following command line (EDIT: I split your ragtag scaffold into contigs):

minigraph -cxggs -j.1 -t40 Vfaba.ref.chr1L.fa Vfaba.contigs.chr1L.fa

Is there any disadvantage to running it on contigs rather than pseudomolecules?

It is preferred to have the reference in scaffolds and other samples in contigs.

I am closing this issue. If you have follow up questions, please feel free to ask here, reopen this issue or start a new issue.

Answer 7 · 2022-05-11T18:29:31.000Z

Many thanks. I tried the v0.18 on the full assembly and it appears to have worked well.
The graph size is about 116% of the reference assembly, which is within the expected range.
For QC purposes I have one additional question (we've had problems using different mappers on these assemblies presumably due to repetitive nature). Would it be possible to extract/output information on the % of sequence (even somewhat roughly) that could be mapped to the graph? The log file just says 'mapped 14378 sequence(s) to the graph' - this is the number of contigs in the second input file.

Answer 8 · 2022-05-12T08:42:54.000Z

And a related question. What happens to the sequences that cannot be aligned?

Answer 9 · 2022-05-14T00:43:50.000Z

I did some tests on how much of the alternative sequence makes it to the graph based on sample1-specific genes. Annotation was transferred from the sample1 onto ref assembly to see how many genes are accession-specific and then onto linearized graph to see how many of the sample1-specific genes are in the graph. Only about 46% of sample1-specific genes makes it into the graph with -j1. Any idea what could be done to improve that?

minigraph-0.18/minigraph -cxggs -j1 -t40 ref.fasta sample1.fasta > faba.minigraph.gfa 2>error
gfa2fa -s faba.minigraph.gfa > faba.minigraph.fa

#lifting sample1.gff3 onto sample1.fa - control that liftoff works- all features should be lifted
liftoff -g sample1.gff3 -o sample1.check.gff3 -p 50 -u unmapped_features.sample1.txt -f features.txt sample1.fasta sample1.fasta
wc -l unmapped_features.sample1.txt
1 unmapped_features.sample1.txt

#lifting sample1.gff3 onto ref.fa - sample1 specific genes should not be lifted and will be in unmapped_features.ref.txt file
liftoff -g sample1.gff3 -o ref.gff3 -p 50 -u unmapped_features.ref.txt -f features.txt ref.fasta sample1.fasta
wc -l unmapped_features.ref.txt
942 unmapped_features.ref.txt #942 sample1 specific not in ref

#lifting sample1.gff3 onto faba.minigraph.fa to see how many sample1 specific genes are in the graph
liftoff -g sample1.gff3 -o graph.gff3 -p 50 -u unmapped_features.graph.txt -f features.txt faba.minigraph.fa sample1.fasta
wc -l unmapped_features.graph.txt
513 unmapped_features.graph.txt #513 sample1 specific not in the graph