pangenome/smoothxg

Parameters for whole human genome.

cgroza opened this issue · 10 comments

Hi,
I have induced a graph genome from three human genomes with minimap2 (-x asm20) and seqwish.
I have removed any alignments smaller than 100kb.
The three genomes are hg19, and the two haplotypes of the NA12878 haplotype resolved assembly published by Shilpa Garg.

Here are the GFA statistics:

gfatools stat minimap_graph.gfa
Number of segments: 12792039
Number of links: 18218447
Number of arcs: 36436894
Max rank: 0
Total segment length: 3537361166
Average segment length: 276.528
Sum of rank-0 segment lengths: 0
Max degree: 14
Average degree: 1.424

It seems that this graph is not particularly complex (max degree 14).
I build this graph with the seqwish parameters:

seqwish -P -b minimap_seqs -g minimap_graph.gfa -s seqs.fa.gz -p seqs_filtered.paf -k 60 -t 40

I try to smooth the seqwish output with, but smoothxg always runs out of memory (180GB RAM).

smoothxg -t 20 -g minimap_graph.gfa -w 50000 -M -J 0.7 -K -I 0 -R 0 -j 5000 -e 5000 -l 10000 -m minimap_graph_smooth.maf -C minimap_graph_smooth.consensus,10,100,1000,10000 -o minimap_graph_smooth.gfa

Is there a particular set of parameters that enables smoothxg to smooth human genomes with less memory?

Hi @cgroza,

does the memory problem occur during the partial order alignment (POA) phase?

Specifying as identity threshold -I 0, you are disabling the sequence clustering that it is executed before the POAs. This clustering is important to avoid getting blocks that are too heterogeneous and would lead to high memory requests. Increasing the identity threshold, the sequence blocks to smooth will become 'easier' for the POA phase, requiring less memory. For humans, we are using values ranging from 0.7 and 0.9.

That worked great, but the resulting smoothed graph attempts to be even more complex.

$ gfatools stat minimap_graph_smooth.gfa
Number of segments: 51245049
Number of links: 74093236
Number of arcs: 148186575
Max rank: 0
Total segment length: 5289386764
Average segment length: 103.218
Sum of rank-0 segment lengths: 0
Max degree: 9351
Average degree: 1.446

Perhaps I also need to tune the -l and -c parameters for the human genome?
I am using the default right now.

ekg commented
ekg commented

Thank you for the input @ekg !
The pggb pipeline is a great tool. Hopefully I can pick out the right parameters.

So I tried again with the suggested parameters:

alignment:
  mapping-tool:       wfmash
  no-splits:          false
  segment-length:     100000
  block-length:       500000
  no-merge-segments:  false
  map-pct-id:         98
  align-pct-id:       0.97
  n-secondary:        5
  mash-kmer:          16
  wfmash:             true
  exclude-delim:      false
seqwish:
  min-match-len:      60
  transclose-batch:   1000000
smoothxg:
  block-weight-max:   50000
  path-jump-max:      5000
  edge-jump-max:      5000
  poa-length-max:     10000
  consensus-spec:     10,100,1000,10000
  block-id-min:       0.9
  ratio-contain:      0.7

Unfortunately, the graph still increases in size from about 3.5 Gbp (seqwish) to about 5Gbp (smoothxg). However, the maximum node degree does not increase much (29).
Is there another smoothxg parameter I haven't adjusted?

ekg commented

Hi again,
So I varied these parameters a bit and they don't seem to do much.
The graph still unfolds and is bigger than the one induced by sequish, both in maximum degree and number of basepairs.
Does it matter that the assemblies has contigs that are below the block size (and are skipped by alignment) but still included in the graph by pggb (as segments with no connections).
Could these be interfering with the smoothing process?

ekg commented