vibansal/HapCUT2

Column order discrepancy between OS

Closed this issue · 4 comments

Problem description

We're using HapCUT2 between for phasing vcf files and have seen the output phased vcf files switches allele 1 and 2 depending on if osx or linux is used (copy A and B becomes B and A). Although this in a biological sense means the same thing we're getting different md5sums which can be confounding when trying to develop cross-platform pipilines and software.

We're phasing according to the 10x linked reads specifications.

Command

HAPCUT2 --nf 1 --fragments <input-linked> --vcf <input-vcf> --out <out-phase> --outvcf 1

HAPCUT2 out-phase files

linux

BLOCK: offset: 6 len: 4 phased: 2 SPAN: 23514 fragments 1
6	0	1	chr1mini	24987	G	C	0/1:.:960:124,134:250,238:99	0	.	41.00	1
9	0	1	chr1mini	48501	T	A	0/1:.:903:158,146:246,214:99	0	.	41.00	1

osx

BLOCK: offset: 6 len: 4 phased: 2 SPAN: 23514 fragments 1
6	1	0	chr1mini	24987	G	C	0/1:.:960:124,134:250,238:99	0	.	41.00	1
9	1	0	chr1mini	48501	T	A	0/1:.:903:158,146:246,214:99	0	.	41.00	1

HapCUT2 starts from a random haplotype solution for each BLOCK which can result in the different order of the two haplotypes. A potential solution would be to always output the haplotype with more '0's in the second column. I will let you know once this has been implemented.

Sounds good, thanks!

The fix has been implemented in branch 'merging_021820'. For each block, the haplotype with '0' allele for the first variant in the block is output first.

Awesome! Thanks!