Triploid Genome using HiCanu

Question

Triploid Genome using HiCanu

Closed this issue 10 months ago · 9 comments

Hello,
I'm using HiCanu on Canu2.0 to perform a haplotype-resolved assembly on a high-ploidy genome (triploid).

I have almost 60x coverage of HiFI reads. However, I encountered an issue where some chromosomes were not successfully assembled. This happened because we used Hi-C reads that were anchored to two homologous chromosomes, but one of the homologous chromosomes had almost twice the coverage of Hi-C reads compared to the other chromosomes.

As a result, I suspect that some chromosomes were not properly represented in my genomic assembly. Can you please advise me on how to adjust the parameters to include these missing chromosomes in the assembly?

Thank you for your assistance.

skoren commented 10 months ago

Idle

Answer 1 · 2023-09-18T14:08:24.000Z

There's no way to separate sequences more in HiFi assembly as it is essentially already using perfect overlaps. You can confirm this from the report section on unitigging and the selected error rate. The higher coverage chromosome is likely because it is nearly identical between 2/3 haplotypes. If that is the case, you'd need to duplicate the sequence of that chromosome in the assembly. While canu will resolve haplotypes, it will introduce switching between then within the long contigs.

You could try verkko in hifi-only mode which would give you a graph representation of the assembly and only output fully-phased sequences along with their coverages. It's Hi-C support doesn't support triploid, only diploid, but the graph structure should still show you if there are places where two chromosomes or identical or not. Either way, you're not going to get the fully phased/resolved assembly automatically. You'd need to do some manual inspection of the graph and likely select which nodes you wanted to duplicate between 1/2/3 haplotypes (as appropriate).

Answer 2 · 2023-09-21T06:47:59.000Z

Thank you for your reply. I have carefully considered your response and I agree that the high coverage of certain chromosomes is likely due to them being almost identical between 2/3 haplotypes. In my previous assembly, I did not use the default hificanu process and only used the assemble module for assembly, which means there was no self-correction process. Based on your response, I have adjusted the parameters as follows: canu corMhapSensitivity=high corMinCoverage=0 -p out -d out_canu genomeSize=7g -pacbio-hifi ${hifireads} useGrid=false corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=500g mhapBlockSize=500 ovlMerDistinct=0.998 –untrimmed I am not sure if these adjustments will allow me to assemble the sequences that are intertwined with each other. Additionally, I suspect that the reason why these two haplotypes cannot be distinguished is because some of their chromosome sequences have a similarity of 0.99, which is lower than the error rate of hifi reads (0.01). However, I am puzzled as to why the other 3/3 haplotypes can be assembled. Does this mean that the evolution between subgenomes is sometimes convergent and sometimes divergent? It's very strange! After I finish assembling with the new parameters, I will try your suggestion to use verkko for assembly. Do you have any suggestions for the parameters of verkko? Thank you. 雨中的雕塑 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "marbl/canu" ***@***.***>; 发送时间: 2023年9月18日(星期一) 晚上10:08 ***@***.***>; ***@***.******@***.***>; 主题: Re: [marbl/canu] Triploid Genome using HiCanu (Issue #2259) There's no way to separate sequences more in HiFi assembly as it is essentially already using perfect overlaps. You can confirm this from the report section on unitigging and the selected error rate. The higher coverage chromosome is likely because it is nearly identical between 2/3 haplotypes. If that is the case, you'd need to duplicate the sequence of that chromosome in the assembly. While canu will resolve haplotypes, it will introduce switching between then within the long contigs. You could try verkko in hifi-only mode which would give you a graph representation of the assembly and only output fully-phased sequences along with their coverages. It's Hi-C support doesn't support triploid, only diploid, but the graph structure should still show you if there are places where two chromosomes or identical or not. Either way, you're not going to get the fully phased/resolved assembly automatically. You'd need to do some manual inspection of the graph and likely select which nodes you wanted to duplicate between 1/2/3 haplotypes (as appropriate). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 3 · 2023-09-21T14:59:21.000Z

The correction is not used for HiFi, only the unitigging step runs which does it's own by more conservative correction step. So, none of the mhap parameters you've specified will matter (essentially all these don't matter: corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=500g mhapBlockSize=500 ovlMerDistinct=0.998. The untrimmed option will trim the HiFi data before assembly which is typically optional but I don't expect it to make much difference in either case.

The error rate in practice is much lower than 1%, almost 0 usually. This is what I suggested confirming in the assembly.report file which will list the actual error rate used in your assembly. I'm not sure of the evolution but couldn't it be that you started with a heterozygous diploid where one set duplicated. The new copy is is acquiring mutations but it still very similar to its origin (making these two hard to separate) but the other haplotype was different enough to start with.

For verkko, I would use default parameters. The graph structure would just give you a clearer idea of the size of these shared blocks and it will use 100% identity when building the graph (at least of the k-mers it uses).

Answer 4 · 2023-09-22T03:08:01.000Z

Thank you for your suggestion. I have assembled the sequences using the default parameters of the Verkko software as you recommended, and I hope to be able to assemble all 3/3 sequences. I would appreciate it if you could keep this issue open for now. When the results are available, I may need to communicate with you again. Thank you very much for your help. Yunyun

…

------------------ 原始邮件 ------------------ 发件人: "marbl/canu" ***@***.***>; 发送时间: 2023年9月21日(星期四) 晚上10:59 ***@***.***>; ***@***.******@***.***>; 主题: Re: [marbl/canu] Triploid Genome using HiCanu (Issue #2259) The correction is not used for HiFi, only the unitigging step runs which does it's own by more conservative correction step. So, none of the mhap parameters you've specified will matter (essentially all these don't matter: corMhapFilterThreshold=0.0000000002 corMhapOptions="--threshold 0.80 --num-hashes 512 --num-min-matches 3 --ordered-sketch-size 1000 --ordered-kmer-size 14 --min-olap-length 2000 --repeat-idf-scale 50" mhapMemory=500g mhapBlockSize=500 ovlMerDistinct=0.998. The untrimmed option will trim the HiFi data before assembly which is typically optional but I don't expect it to make much difference in either case. The error rate in practice is much lower than 1%, almost 0 usually. This is what I suggested confirming in the assembly.report file which will list the actual error rate used in your assembly. I'm not sure of the evolution but couldn't it be that you started with a heterozygous diploid where one set duplicated. The new copy is is acquiring mutations but it still very similar to its origin (making these two hard to separate) but the other haplotype was different enough to start with. For verkko, I would use default parameters. The graph structure would just give you a clearer idea of the size of these shared blocks and it will use 100% identity when building the graph (at least of the k-mers it uses). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 5 · 2023-10-17T11:45:39.000Z

Hi Yunyun,

Could you please let me know the commands you used to get your 3/3 sets? I would really appreciate the details here. Also did you do the HiC scaffolding later? if yes, please tell that bit too.

Answer 6 · 2023-10-17T11:56:58.000Z

Hi， In my species, there are a total of 14 3/3 (42) chromosomes. Unfortunately, out of these, only 12 3/3 (36) haploid chromosomes were successfully assembled and phased. Regrettably, 2 of the 2/3 (4) haploid chromosomes could not be assembled. The reason behind this remains uncertain. It's important to note that my results are based on the hic data scaffolding. 雨中的雕塑 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "marbl/canu" ***@***.***>; 发送时间: 2023年10月17日(星期二) 晚上7:45 ***@***.***>; ***@***.******@***.***>; 主题: Re: [marbl/canu] Triploid Genome using HiCanu (Issue #2259) Hi Yunyun, Could you please let me know the commands you used to get your 3/3 sets? I would really appreciate the details here. Also did you do the HiC scaffolding later? if yes, please tell that bit too. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 7 · 2023-10-17T12:00:09.000Z

Thanks Yunyun,

Could you please tell your work flow step by step. I have HiC too.

Answer 8 · 2023-10-17T12:03:17.000Z

First, I assembled the genome using HiFi reads with Hicanu. Then, I performed scaffolding of the contigs using Hi-C data. 雨中的雕塑 ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "marbl/canu" ***@***.***>; 发送时间: 2023年10月17日(星期二) 晚上8:00 ***@***.***>; ***@***.******@***.***>; 主题: Re: [marbl/canu] Triploid Genome using HiCanu (Issue #2259) Thanks Yunyun, Could you please tell your work flow step by step. I have HiC too. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>