phasegenomics/FALCON-Phase

Real deletions or missing homozygous regions?

Closed this issue · 1 comments

Hi there!

I ran FALCON-Phase successfully on my 905 primary contigs (~333 Mbp) and 6,933 haplotigs (~410 Mbp). I got 905 contigs in both phase0 and phase1. The total number of bases of the two output was approximately 356 Mbp, which is exactly the cytologically estimated genome size.

In particular, the software produced 356,652,293 bp for phase0 and 357,975,459 for phase1, respectively. So there are 1,323,166 bp between phase0 and phase1. I think a good part is due to real deletions between the contigs of the two haplotypes. Does it sound reasonable to you?

However, I aligned random pairs of homologous contigs and in some cases I found long deletions at the beginning or at the end of one of the two contigs. Are these real deletions? Or homozygous regions that may not have been correctly assigned by the program (like a missing "copy and paste" of the homologous region)? Or could they be hemizygous regions?

Thanks in advance for any clarification!

Hi,

Generally speaking yes, this sounds reasonable. The two phases are usually different sizes, and the difference here of 0.37% is small enough to be believable.

For your specific questions, I would say FALCON-Phase thinks they are real deletions, but that doesn't necessarily guarantee they are real 😊 FALCON-Phase does attempt to be aware of homozygosity and faithfully put copies of homozygous sequence in both phases. So, it must have had some evidence in the Hi-C for suspecting these sequences are heterozygous. Hemizygosity is a possibility but you'd probably want to rule out other explanations first.

How much coverage of both PacBio and Hi-C did you have? Insufficient coverage could lead to this kind of thing too, but as long as you had enough, that shouldn't be a problem. We recommend the greater of 150M read pairs or 100M read pairs per gigabase for Hi-C, and I believe current PacBio recommendations are 30X for CCS reads, or 60X for CLR.

Thanks,

Shawn