marbl/canu

Triploid Genome using HiCanu

Closed this issue · 9 comments

Hello,
I'm using HiCanu on Canu2.0 to perform a haplotype-resolved assembly on a high-ploidy genome (triploid).

I have almost 60x coverage of HiFI reads. However, I encountered an issue where some chromosomes were not successfully assembled. This happened because we used Hi-C reads that were anchored to two homologous chromosomes, but one of the homologous chromosomes had almost twice the coverage of Hi-C reads compared to the other chromosomes.

As a result, I suspect that some chromosomes were not properly represented in my genomic assembly. Can you please advise me on how to adjust the parameters to include these missing chromosomes in the assembly?

Thank you for your assistance.

skoren commented

There's no way to separate sequences more in HiFi assembly as it is essentially already using perfect overlaps. You can confirm this from the report section on unitigging and the selected error rate. The higher coverage chromosome is likely because it is nearly identical between 2/3 haplotypes. If that is the case, you'd need to duplicate the sequence of that chromosome in the assembly. While canu will resolve haplotypes, it will introduce switching between then within the long contigs.

You could try verkko in hifi-only mode which would give you a graph representation of the assembly and only output fully-phased sequences along with their coverages. It's Hi-C support doesn't support triploid, only diploid, but the graph structure should still show you if there are places where two chromosomes or identical or not. Either way, you're not going to get the fully phased/resolved assembly automatically. You'd need to do some manual inspection of the graph and likely select which nodes you wanted to duplicate between 1/2/3 haplotypes (as appropriate).

skoren commented

The correction is not used for HiFi, only the unitigging step runs which does it's own by more conservative correction step. So, none of the mhap parameters you've specified will matter (essentially all these don't matter: corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=500g mhapBlockSize=500 ovlMerDistinct=0.998. The untrimmed option will trim the HiFi data before assembly which is typically optional but I don't expect it to make much difference in either case.

The error rate in practice is much lower than 1%, almost 0 usually. This is what I suggested confirming in the assembly.report file which will list the actual error rate used in your assembly. I'm not sure of the evolution but couldn't it be that you started with a heterozygous diploid where one set duplicated. The new copy is is acquiring mutations but it still very similar to its origin (making these two hard to separate) but the other haplotype was different enough to start with.

For verkko, I would use default parameters. The graph structure would just give you a clearer idea of the size of these shared blocks and it will use 100% identity when building the graph (at least of the k-mers it uses).

Hi Yunyun,

Could you please let me know the commands you used to get your 3/3 sets? I would really appreciate the details here. Also did you do the HiC scaffolding later? if yes, please tell that bit too.

Thanks Yunyun,

Could you please tell your work flow step by step. I have HiC too.

Idle