Positive and Negative Sense concatenating together for each contig
BABRIGGS opened this issue · 17 comments
When I assemble various bacterial genomes each contig/genome seems to be doubled in size due to the entire contig being repeated and then concatenated together. One repeat seems to be the positive sense strand while the other seems to be the negative sense strand. This has happened to circular and linear contigs ranging from 16Kb to 10Mb.
Is there a way to combat this or correct it?
I have included the entire Flye log with this post.
Hello,
In your email you mentioned a 5Mb duplication, but your assembly size is 1.9 Mb. Is this the same dataset?
Which contigs have duplications and how did you detect this duplication?
Best,
Misha
If you have two bacteria in the sample, it is likely that Flye actually recovers two separate genome. What exactly do you mean by doubled and how do you quantify this?
Hi Barrett,
Sorry for my late response. I am not sure I 100% understand what do you mean by doubling. How do you quantify this? Is there an output of a tool, from which you conclude that there is doubling? Can you share and describe your interpretation?
Flye in general should not produce extensive duplications and I have not seen anything like this before. Could you please produce dot-plots of the large contigs (1mb+) that you think are duplicated? E.g. with tools like MashMap or gepard (https://cube.univie.ac.at/gepard)?
Thank you, this info is very helpful!
I would separate duplications in plasmids and chromosomal sequences. Duplications of plasmids is indeed a known issue with Flye. It is mostly relevant to circular sequences under 100kb. Here is a nice writeup by Ryan Wick that covers plasmids, but has some other useful info: https://rrwick.github.io/2020/10/30/guide-to-bacterial-genome-assembly.html
It is strange however that plasmids are duplicated into the opposite strand. Is it possible that these are linear plasmids, with some kind of inverted terminal repeats? And maybe captured at a weird stage of their replication cycle?
I have the same suspicion about your chromosome. In general, it is definitely unexpected for Flye to make artificial duplications of 100kb+. Could these be inverted terminal repeats?
It would be helpful if you can also upload the Bandage visualization of assembly_graph.gfa
and assembly_info.txt
file.
Thanks,
Misha
I have uploaded the info txt file, however, GitHub not let me upload the gfa file.
assembly_info.txt
The info file also indicates plasmids that are supposed to be linear as circular (contig 29, 35, 34,36,32, and 38). Contig 37 should be the only circular one. Contigs 17, 18, 19, 20, 21, and 22 make up a group of highly similar circular plasmids. We expected to have difficulty assembling them so I am not as concerned with those, but the others should not be displaying circular. Could that be why they are showing up duplicated? If that is the case, is there a way to fix that?
I would be highly suspicious of ITRs as there are duplicated genes that should not be at the ends of strands of the DNA. It is more so that the entire contig is duplicated/repeated.
I will take a look at RRwick's info.
Thanks so much again,
Barrett
Thanks Barrett,
For gfa - I don't need the actual file, but could you please use this tool to visualize and just post the image? https://github.com/rrwick/Bandage
From the assmebly_info, I see that there is definitely a mix of linear and circular. What I would also do - try visualizing read alignments against the assembly in IGV. You can check the alignments around the areas where you expect chromosome to end. If it's an artifact, you should see none or a few reads spanning these positions. Feel free to post those as well.
Barrett - you'll need to realign original reads against the assembly and use bam file as an IGV input.
I've seen a similar pattern with one of my assemblies as well. The bandage image is below, edge 2 and 3 are the contigs of interest.I have screen shots of a BLAST dotplot (self to self) and coverage from Minimap visualized in Geneious. Edge 2 is the first pair of images. Edge 3 is the second pair of images.
Thanks- Adam
Adam - these could be self-complementary repeats (e.g. ATATATATA).
Closing due to inactivity - feel free to follow up if you have more questions!