Trouble understanding the GFAs obtained from wtdbg2
jflot opened this issue · 5 comments
I am having trouble understanding the GFAs I obtain using wtdbg2. Here is a simple example as an illustration, using a small bacterial genome.
wget https://downloads.pacbcloud.com/public/dataset/MicrobialMultiplexing_48plex/48-plex_sequences/lima.bc1018--bc1018.subreadset.fastq.gz
wtdbg2 -x preset2 -i lima.bc1018--bc1018.subreadset.fastq.gz -o bc1018
wtpoa-cns -t 4 -i bc1018.ctg.lay.gz -fo bc1018.ctg.fa
I obtain one contig that is 1,692,928 bp long.
However, when I convert the ctg.dot file into gfa:
wtdbg-dot2gfa.pl bc1018.ctg.dot > bc1018.ctg.gfa
it contains three contigs of total lengths 6,592bp
S F0 * LN:i:5128
S F1 * LN:i:1458
S F2 * LN:i:6
L F0 + F2 - 0S gl:i:7 rc:i:36
L F0 - F1 - 0S gl:i:11 rc:i:37
L F1 + F0 + 0S gl:i:11 rc:i:37
L F1 - F2 + 0S gl:i:14 rc:i:9
L F2 + F0 - 0S gl:i:7 rc:i:36
L F2 - F1 + 0S gl:i:14 rc:i:9
What is the explanation for this discrepancy?
Also, is there a way to obtain a "consensused" GFA (i.e., a GFA in which the segments are the contigs produced by wtdbg2 after the consensus step) using wtpoa-cns or wtdbg-cns (or another tool)? This would be extremely useful for my research.
Thanks a lot in advance for your help,
Jean-François
bc1018.ctg.dot contained all contigs, but wtdbg2 only reported larger contigs, please see wtdbg2 --help
for '--ctg-min-length' and --ctg-min-nodes
.
I tried running wtdbg2 --ctg-min-length 0 --ctg-min-nodes 0 -x preset2 -i lima.bc1018--bc1018.subreadset.fastq.gz -o bc1018 but I obtain exactly the same result.
There are three contigs in the ctg.dot but they are indicated as tiny: respectively 5,128 bp, 1,458 bp and 6 bp (if I understand well what is written in the file)
$ cat bc1018.ctg.dot
digraph {
node [shape=record]
F0 [label="{F0 1362 5128/5128 | { {N1444:- | m64015_190913_215439/106430572 | R_87_4} | {N1047:- | m64015_190913_215439/170591768 | R_45_4}}}"]
F1 [label="{F1 392 1458/1458 | { {N1398:+ | m64015_190913_215439/170591768 | F_78_4} | {N1615:+ | m64015_190913_215439/106430572 | F_71_4}}}"]
F2 [label="{F2 2 6/6 | { {N750:- | m64015_190913_215439/170591768 | F_59_5} | {N1276:+ | m64015_190913_215439/170591768 | R_57_3}}}"]
F0 -> F2 [label="+-:36:7" color=green style=solid]
F0 -> F1 [label="--:37:11" color=gray style=solid]
F1 -> F0 [label="++:37:11" color=blue style=solid]
F1 -> F2 [label="-+:9:14" color=red style=solid]
F2 -> F0 [label="+-:36:7" color=green style=solid]
F2 -> F1 [label="-+:9:14" color=red style=solid]
}
What I do not understand is, how do three contigs so small get transformed into one 1,692,928 bp-long contig after the consensus step?
Here is what is written at the end of the bc1018.event file:
BINARY_LINK F0 - F1 - 37 11
BINARY_LINK F0 + F1 + 21 28
ctg0 F1 - 0
ctg0 F2 + 376832
ctg0 F0 - 380160
OUTPUT_CTG ctg0 -> ctg1 nodes=1756 len=1692928
but I have difficulties understanding what this means.
Ok, I see. the lengths in ctg.dot is counted in BINs (256bp). I have modified wtdbg-dot2gfa.pl, see e9ab08b
Thanks! Now this makes sense.
Is there a way to get the consensus as GFA instead of FASTA? It would be very useful if wtpoa-cns could produce a GFA in which each segment corresponds to one "consensused" contig, as more and more downstream tools are able to take a GFA as input (e.g. SALSA2, https://doi.org/10.1371/journal.pcbi.1007273) but the error rate in the segments should not be too high.
FASTA is more convenient for users. BTW, I am developing the next version of wtdbg, will consider to offer an option to output GFA instead of FASTA. Thanks for your suggestion!