GaetanBenoitDev/metaMDBG

Clarification about rescued circular contigs

Closed this issue · 3 comments

Thanks for the work in MetaMDBG!

In the README you mention that the _rc suffix flags "rescued circular " contigs and that the circularity of such sequences is not as reliable. However, I couldn't find a more in depth explanation of what these rescued circular contigs are and why their circularity needs to be further validated.

Can you provide an explanation of what exactly "rescued circular" stands for?

These are long linear contigs (~ >1Mb) with both ends very close in the assembly graph (typically separated by a single unitig). This unitig is probably a long repeat. If this repeat is a inter-genomic repeat, the rescued contig is likely to be complete. The problem is that it can be an intra-genomic repeat, in this case the rescued contig will probably be incomplete.

Thanks @GaetanBenoitDev! This makes sense to me

I got an additional question about circular contigs: I've noticed that I got some circular contigs with long terminal repeats (between 7.5 and 10kb), which I assumed collapsed when the contig was extracted from the assembly graph. I also have some contigs that contain one or more repetitions of the same unit. See below the average frequencies of 21-mers for 10 circular contigs of an assembly (FASTA file here):

ctg96262_7x_c	1.0019
ctg93556_5x_c	1.0007
ctg90807_4x_c	1.7587
ctg92161_1x_c	1.4486
ctg86413_0x_c	4.2598
ctg85961_0x_c	6.5512
ctg89183_0x_c	1.4074
ctg85018_0x_c	6.3019
ctg91152_1x_c	1.7125
ctg86752_23x_c	4.3003

ctg92161_1x_c is a contig with terminal repeats ~7.5kb (that is, has a repetition of the first 7.5kb of the "unit"). ctg90807_4x_c is a contig with a complete unit plus a repetition of the first 75% basepairs of that same unit.

Is this sort of assembly expected? I don't see these results in the hifiasm-meta assembly.

I have removed the rescued circular concept in the v1.0