Parameters for whole human genome.

Question

Parameters for whole human genome.

cgroza opened this issue 4 years ago · 10 comments

Hi,
I have induced a graph genome from three human genomes with minimap2 (-x asm20) and seqwish.
I have removed any alignments smaller than 100kb.
The three genomes are hg19, and the two haplotypes of the NA12878 haplotype resolved assembly published by Shilpa Garg.

Here are the GFA statistics:

gfatools stat minimap_graph.gfa
Number of segments: 12792039
Number of links: 18218447
Number of arcs: 36436894
Max rank: 0
Total segment length: 3537361166
Average segment length: 276.528
Sum of rank-0 segment lengths: 0
Max degree: 14
Average degree: 1.424

It seems that this graph is not particularly complex (max degree 14).
I build this graph with the seqwish parameters:

seqwish -P -b minimap_seqs -g minimap_graph.gfa -s seqs.fa.gz -p seqs_filtered.paf -k 60 -t 40

I try to smooth the seqwish output with, but smoothxg always runs out of memory (180GB RAM).

smoothxg -t 20 -g minimap_graph.gfa -w 50000 -M -J 0.7 -K -I 0 -R 0 -j 5000 -e 5000 -l 10000 -m minimap_graph_smooth.maf -C minimap_graph_smooth.consensus,10,100,1000,10000 -o minimap_graph_smooth.gfa

Is there a particular set of parameters that enables smoothxg to smooth human genomes with less memory?

Answer 1 · 2021-03-01T21:15:32.000Z

Hi @cgroza,

does the memory problem occur during the partial order alignment (POA) phase?

Specifying as identity threshold -I 0, you are disabling the sequence clustering that it is executed before the POAs. This clustering is important to avoid getting blocks that are too heterogeneous and would lead to high memory requests. Increasing the identity threshold, the sequence blocks to smooth will become 'easier' for the POA phase, requiring less memory. For humans, we are using values ranging from 0.7 and 0.9.

Answer 2 · 2021-03-01T21:26:49.000Z

Precisely, it occurs halfway in the POA step. Thank you for the tip, will try and report back.

…

-------- Original Message --------

On Mar. 1, 2021, 4:15 p.m., Andrea Guarracino wrote: Hi ***@***.***(https://github.com/cgroza), does the memory problem occur during the partial order alignment (POA) phase? Specifying as identity threshold -I 0, you are disabling the sequence clustering that it is executed before the POAs. This clustering is important to avoid getting blocks that are too heterogeneous and would lead to high memory requests. Increasing the identity threshold, the sequence blocks to smooth will become 'easier' for the POA phase, requiring less memory. For humans, we are using values ranging from 0.7 and 0.9. — You are receiving this because you were mentioned. Reply to this email directly, [view it on GitHub](#98 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAE4SC2GFP4VEYHMXGGKP33TBP7YHANCNFSM4YNKFH2Q).

Answer 3 · 2021-03-02T14:36:13.000Z

That worked great, but the resulting smoothed graph attempts to be even more complex.

$ gfatools stat minimap_graph_smooth.gfa
Number of segments: 51245049
Number of links: 74093236
Number of arcs: 148186575
Max rank: 0
Total segment length: 5289386764
Average segment length: 103.218
Sum of rank-0 segment lengths: 0
Max degree: 9351
Average degree: 1.446

Perhaps I also need to tune the -l and -c parameters for the human genome?
I am using the default right now.

Answer 4 · 2021-03-02T15:48:19.000Z

These kind of high-degree nodes can from paralagous repetitive mappings. Due to the chain and align model in minimap2, it may be difficult to avoid these without higher-level filtering. To eliminate these, I suggest you use wfmash and specify a long mashmap seed length and minimum mapping block length. Presently, I am using -p 98 -s 100000 -l 500000 -n 5 for human-to-human alignments (with -n giving the number of alternate mappings).

…

On Tue, Mar 2, 2021 at 3:36 PM Groza Cristian ***@***.***> wrote: That worked great, but the resulting smoothed graph attempts to be even more complex. $ gfatools stat minimap_graph_smooth.gfa Number of segments: 51245049 Number of links: 74093236 Number of arcs: 148186575 Max rank: 0 Total segment length: 5289386764 Average segment length: 103.218 Sum of rank-0 segment lengths: 0 Max degree: 9351 Average degree: 1.446 Perhaps I also need to tune the -l and -c parameters for the human genome? I am using the default right now. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEMVF5DJ64TAHKIZ54TTBTZW7ANCNFSM4YNKFH2Q> .

Answer 5 · 2021-03-02T15:50:52.000Z

I'm possibly mixing up two issues. It is not unusual for the graph complexity to increase in terms of numbers of nodes, but the size should come down. In smoothxg, using the block split parameter can also help to mitigate this, where the high-degree nodes come from repeats that are in cis, like VNTRs. We've been testing smoothxg -I 0.9 -R 0.7 for instance. On Tue, Mar 2, 2021 at 4:48 PM Erik Garrison <erik.garrison@gmail.com> wrote:

…

These kind of high-degree nodes can from paralagous repetitive mappings. Due to the chain and align model in minimap2, it may be difficult to avoid these without higher-level filtering. To eliminate these, I suggest you use wfmash and specify a long mashmap seed length and minimum mapping block length. Presently, I am using -p 98 -s 100000 -l 500000 -n 5 for human-to-human alignments (with -n giving the number of alternate mappings). On Tue, Mar 2, 2021 at 3:36 PM Groza Cristian ***@***.***> wrote: > That worked great, but the resulting smoothed graph attempts to be even > more complex. > > $ gfatools stat minimap_graph_smooth.gfa > Number of segments: 51245049 > Number of links: 74093236 > Number of arcs: 148186575 > Max rank: 0 > Total segment length: 5289386764 > Average segment length: 103.218 > Sum of rank-0 segment lengths: 0 > Max degree: 9351 > Average degree: 1.446 > > Perhaps I also need to tune the -l and -c parameters for the human genome? > I am using the default right now. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#98 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABDQEMVF5DJ64TAHKIZ54TTBTZW7ANCNFSM4YNKFH2Q> > . >

Answer 6 · 2021-03-02T15:56:44.000Z

Thank you for the input @ekg !
The pggb pipeline is a great tool. Hopefully I can pick out the right parameters.

Answer 7 · 2021-03-04T18:58:28.000Z

So I tried again with the suggested parameters:

alignment:
  mapping-tool:       wfmash
  no-splits:          false
  segment-length:     100000
  block-length:       500000
  no-merge-segments:  false
  map-pct-id:         98
  align-pct-id:       0.97
  n-secondary:        5
  mash-kmer:          16
  wfmash:             true
  exclude-delim:      false
seqwish:
  min-match-len:      60
  transclose-batch:   1000000
smoothxg:
  block-weight-max:   50000
  path-jump-max:      5000
  edge-jump-max:      5000
  poa-length-max:     10000
  consensus-spec:     10,100,1000,10000
  block-id-min:       0.9
  ratio-contain:      0.7

Unfortunately, the graph still increases in size from about 3.5 Gbp (seqwish) to about 5Gbp (smoothxg). However, the maximum node degree does not increase much (29).
Is there another smoothxg parameter I haven't adjusted?

Answer 8 · 2021-03-04T21:07:35.000Z

Interesting, we should confirm that this isn't causing problems: `align-pct-id: 0.97`. I had thought this should be 0, but I also think it isn't affecting wfmash. So it might be that block-id-min: 0.9, ratio-contain: 0.7 is causing the graph to unfold somehow. Try setting these much lower, to 0.5/0.3 or similar. Does the graph get smaller?

…

On Thu, Mar 4, 2021, 19:58 Groza Cristian ***@***.***> wrote: So I tried again with the suggested parameters: alignment: mapping-tool: wfmash no-splits: false segment-length: 100000 block-length: 500000 no-merge-segments: false map-pct-id: 98 align-pct-id: 0.97 n-secondary: 5 mash-kmer: 16 wfmash: true exclude-delim: false seqwish: min-match-len: 60 transclose-batch: 1000000 smoothxg: block-weight-max: 50000 path-jump-max: 5000 edge-jump-max: 5000 poa-length-max: 10000 consensus-spec: 10,100,1000,10000 block-id-min: 0.9 ratio-contain: 0.7 Unfortunately, the graph still increases in size from about 3.5 Gbp (seqwish) to about 5Gbp (smoothxg). However, the maximum node degree does not increase much (29). Is there another smoothxg parameter I haven't adjusted? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEJSHHDI3YIQWKXFT4TTB7J6HANCNFSM4YNKFH2Q> .

Answer 9 · 2021-03-05T17:23:04.000Z

Hi again,
So I varied these parameters a bit and they don't seem to do much.
The graph still unfolds and is bigger than the one induced by sequish, both in maximum degree and number of basepairs.
Does it matter that the assemblies has contigs that are below the block size (and are skipped by alignment) but still included in the graph by pggb (as segments with no connections).
Could these be interfering with the smoothing process?

Answer 10 · 2021-03-06T13:23:44.000Z

I've just fixed a problem with wfmash that was leaving a lot of unaligned contigs below the block length filter. But I'm not sure if that's your problem here. The isolated segments should not be interfering. But they may make the graph larger than expected. We are doing some iterations of single chromosomes to continue refining the parameters. Stay tuned. We can try to confirm your results. We weren't looking at the change in size between seqwish and smoothxg. Usually, it's seemed to come down both in nodes, size, and maximum node degree.

…

On Fri, Mar 5, 2021, 18:23 Groza Cristian ***@***.***> wrote: Hi again, So I varied these parameters a bit and they don't seem to do much. The graph still unfolds and is bigger than the one induced by sequish, both in maximum degree and number of basepairs. Does it matter that the assemblies has contigs that are below the block size (and are skipped by alignment) but still included in the graph by pggb (as segments with no connections). Could these be interfering with the smoothing process? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#98 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQEMPMCYXDLENZM4UH63TCEHQPANCNFSM4YNKFH2Q> .