vgteam/vg

`vg convert --vg-algorithm` loses start coordinates of paths in its W lines

zhangyixing3 opened this issue · 3 comments

Dear sir,
I would like to convert a filtered pangenome GBZ index back to the original GFA . During this process, to maintain consistency with the full or clip graph nodes, I used the --vg-algorithm option. However, I noticed that the path information for non-reference paths in the 97samples.d10.vg.algorithm.gfa file seems to be incorrect. This seems a bit abnormal ?

vg convert -f 97samples.d10.gbz   --vg-algorithm > 97samples.d10.vg.algorithm.gfa
vg convert -f 97samples.d10.gbz   > 97samples.d10.gfa

results

  1. 97samples.d10.gfa OK
grep "^W"  97samples.d10.gfa | less -S                    

image

  1. 97samples.d10.vg.algorithm.gfa The coordinate information is lost, and all start positions are 0
grep "^W"   97samples.d10.vg.algorithm.gfa | less -S                    

image

It looks like the VG algorithm is meant to preserve the start offset of the path:

vg/src/gfa.cpp

Lines 197 to 204 in e9fbbc3

size_t start_offset = 0;
size_t end_offset = 0;
if (subrange != PathMetadata::NO_SUBRANGE) {
start_offset = subrange.first;
if (subrange.second != PathMetadata::NO_END_POSITION) {
end_offset = subrange.second;
}
}

Maybe the GBZ is not actually exposing these paths as having start offsets? If you run vg paths --metadata --sample 001_6137 -x 97samples.d10.gbz, do these paths claim to have NO_SUBRANGE or do they properly list their subrange coordinates on the base path?

The PathMetadata implementation for GBWTGraph assumes that subranges can only exist for reference/generic paths, and only haplotype paths can have phase blocks. The GBWTGraph algorithm avoids the issue, because it works with GBWT / GBWTGraph semantics rather than libhandlegraph semantics.

@zhangyixing3 The underlying issue is that GBZ was designed to both store the original GFA and expose an equivalent graph with integer node identifiers and nodes no longer than 1024 bp. If you use the GBWTGraph algorithm to convert GBZ back to GFA, you get the original GFA, where segments can have string names and be arbitrarily long. If you want a GFA with integer node identifiers and short nodes, you can use option --no-translation with the GBWTGraph algorithm.

Thank you, the --vg-algorithm parameter indeed results in W lines representing offsets. Using --no-translation, I successfully obtained graphs where node lengths do not exceed 1024, along with their coordinate information.