`vg convert --vg-algorithm` loses start coordinates of paths in its W lines
zhangyixing3 opened this issue · 3 comments
Dear sir,
I would like to convert a filtered pangenome GBZ index back to the original GFA . During this process, to maintain consistency with the full or clip graph nodes, I used the --vg-algorithm
option. However, I noticed that the path information for non-reference paths in the 97samples.d10.vg.algorithm.gfa file seems to be incorrect. This seems a bit abnormal ?
vg convert -f 97samples.d10.gbz --vg-algorithm > 97samples.d10.vg.algorithm.gfa
vg convert -f 97samples.d10.gbz > 97samples.d10.gfa
results
- 97samples.d10.gfa OK
grep "^W" 97samples.d10.gfa | less -S
- 97samples.d10.vg.algorithm.gfa The coordinate information is lost, and all start positions are 0
grep "^W" 97samples.d10.vg.algorithm.gfa | less -S
It looks like the VG algorithm is meant to preserve the start offset of the path:
Lines 197 to 204 in e9fbbc3
Maybe the GBZ is not actually exposing these paths as having start offsets? If you run vg paths --metadata --sample 001_6137 -x 97samples.d10.gbz
, do these paths claim to have NO_SUBRANGE or do they properly list their subrange coordinates on the base path?
The PathMetadata
implementation for GBWTGraph
assumes that subranges can only exist for reference/generic paths, and only haplotype paths can have phase blocks. The GBWTGraph algorithm avoids the issue, because it works with GBWT / GBWTGraph semantics rather than libhandlegraph semantics.
@zhangyixing3 The underlying issue is that GBZ was designed to both store the original GFA and expose an equivalent graph with integer node identifiers and nodes no longer than 1024 bp. If you use the GBWTGraph algorithm to convert GBZ back to GFA, you get the original GFA, where segments can have string names and be arbitrarily long. If you want a GFA with integer node identifiers and short nodes, you can use option --no-translation
with the GBWTGraph algorithm.
Thank you, the --vg-algorithm
parameter indeed results in W lines representing offsets. Using --no-translation
, I successfully obtained graphs where node lengths do not exceed 1024, along with their coordinate information.