waveygang/wfmash

Incorrect CIGAR string generation in versions 0.16 through 0.21

Opened this issue · 8 comments

Dear wfmash developers,

I've identified an issue with CIGAR strings in PAF files generated by wfmash versions 0.16 and later. This problem appears to be related to the inversion patching feature introduced in v0.16.

Key points:

  • Affected versions: 0.16, 0.17, 0.18, 0.21 (not tested: 0.19, 0.20)
  • The issue causes problems when processing PAF files with other tools, such as rustybam.

Example error (using rustybam break-paf -m 5000):
toy.zip

thread 'main' panicked at src/paf.rs:71:43:
called Result::unwrap() on an Err value: PafParseCigar { msg: "query bases 4000 from cigar does not equal 59000-55354=3646\nCM055321.1\t82983525\t55354\t59000\t-\tscaffold_1\t182733053\t172879498\t172883499\t3627\t4126\t9\tid:Z:\tcg:Z:3X1=2I1=3X1=1X2=1X2=1X1=1I1=1X1=1X1=1X1=2X2=1X3=2X1=1X1=1X1=1D2=1X3=1I1X1=1X2=1X1=59I3=2X1=1X2=2I2=1X1=1X1=1X2=1X3=3X1=2X1=2X1=1X2=3I1X1=2X3=1X4=1X3=1X1=1X1=1I1X2=4X1=1X2=1X1=1X7D2=1X1=2X4=1X1=1X3=1X3=1X3=4X1=1X4=1X1=1X1=1X1=5X8=1X2=2D1=2X1=1X1=5X1=1X2=1D1X5=1X1=2X1=1X2=1X1=1X3=1X1=78D38=20I110=3D2=2X18=1X67=1X419=1X20=3I367=1X113=1X63=1X82=1X332=1I17=1D84=1X32=1X161=1X25=1X123=1X225=1X157=3D21=1X24=1X282=2I278=1X214=1X46=1X3=5D3=1X2=1X1=2I1X3=1X1=1X2=4D1=2X1=3X1=1X4=1X1=4D1X1=2X3=1X2=2D1=1X1=1X2=1X1=1X2=1X2=2I1=1X2=2X2=1X5=1X4=3D1=1X1=2X1=1X3=1X2=6D3=3X4=3X3=3X1=1X1=2X1=3X1=1X1=1X1=2X1=3X2=2X1=1X1=2X2=1X1=2X3=1X2=1X1=4X1=1X1=2I3=1I1=1X3=2X2=1X1=2D1X2=3X2=3X1=2X1=3X3=1X1=1X3=2X2=3X4=1X2=1D3=3D1X4=2X1=1X1=2X3=1X3=3X1=2X1=4X1=1X4=2I1=1X2=3X1=1X2=2I1X1=1X4=1I2X1=1X1=1X1=2X3=1X2=1X1=1X2=11I1X4=3X3=1X2=1X1=2X2=1X1=1X1=7I1X3=1X\n" }

Steps to reproduce:

  • Generate a PAF file using wfmash v0.21 with parameters: -t 96 -4 -p 60
  • Process the resulting PAF file with rustybam

This issue does not occur with wfmash v0.15.

Could you please investigate this CIGAR string inconsistency? It would be helpful to understand if this is a bug or if there have been changes in the CIGAR string format that need to be addressed in downstream tools.

Thank you for your attention to this matter.

Best regards,
Gaorui

ekg commented

The next release will resolve this. Thanks!

ekg commented

Does the current main HEAD resolve this issue? I've now integrated integration tests of PAF correctness, which should be equivalent to the SAM correctness using https://github.com/ekg/pafcheck.

Thank you for the quick response!
Unfortunately, I am working on a cluster that lacks some necessary libraries, and I am unable to compile wfmash from source.
Would it be possible for you to provide a precompiled binary of wfmash?

@GRGong You could change the Dockerfile with wfmash HEAD for docker image. If you don't have access to root, singlularity remote builder would be help (https://cloud.sylabs.io/builder).

ekg commented

@GRGong here's a wfmash binary. I should probably make a release, but I prefer to do that once you've confirmed that this resolves the issues you're seeing. If not, we should resolve and add some automated tests to prevent future problems. Right now I'm testing SAM, PAF, and MAF conversion steps using github actions.

Just gunzip and make sure it's executable: wfmash-v0.21.0-38-gb731e41.gz

@ekg Thanks for the binary. I tested the provided binary using my own genomes, but it still has the CIGAR problem. For your reference, I’ve uploaded the query and target FASTA files, along with the command I used and the error log.

Here is the link:
https://drive.google.com/file/d/18MzFalZhVnKt-hTfTxmxI2KsdZxh6Zsf/view?usp=sharing

Note: The two genomes belong to divergent insect species, but they are still in the same subfamily. The previous version, wfmash v0.15, worked without issues.

Hi,

I want to follow up regarding the issue.
Could you kindly let me know if there have been any updates or insights regarding this issue?

Best,
gr

ekg commented