marbl/MHAP

Output format

Closed this issue · 5 comments

It is not very clear how what output values actually represent.

I've found in docs that a record looks like
[A ID] [B ID] [Jaccard score] [# shared min-mers] [0=A fwd, 1=A rc] [A start] [A end] [A length] [0=B fwd, 1=B rc] [B start] [B end] [B length] so I have few questions.

  • Are end values inclusive or exclusive?
  • If overlap looks like
--------       A fwd
     --------  B rc

what would be approx values for read B?
Would start be 7 and end 10 or it would be 0 and 3, respectively?

I am sorry if it looks very obvious, I am coming from AMOS world where everything goes crazy.

  • I think it is now clear that number would be 0 and 3 and not 7 and 10.
  • second question related to upper limit inclusiveness/exclusiveness is still here

Hi,

The values are inclusive but the bounds are not exact in MHAP because there is no full alignment done by the code.

The reverse coordinates in your example should be 7-10. That is, they're always in the forward strand of the read so you'd take positions 7-10 from the fwd read and reverse complement it to get the overlapping bases.

Thank you @skoren for both answers. Since there is no real alignment done by MHAP, I should consider them just as a good candidates for running real alignment algorithm on, right?

It depends on what your goal is. The overlaps have high specificity (99% or higher) which means if MHAP outputs an overlap there is matching sequence there and the positions are within a small error percent. If you only want to know which positions of reads map to other reads and build layouts, like we do in Celera Assembler for correction the output is sufficient. Most consensus programs would be able to generate corrected sequences given this input. However, if you need to know the full gapped alignment within an overlap, you would want to run an alignment on the overlapping regions.

I need it for building layout graph, too. It is just a little bit tricky to detect containments and transitive edges since it is not very clear which end of read some overlap consumes.
Reading Celera source will probably help.
Thank you for your help.