mikolmogorov/Flye

Positive and Negative Sense concatenating together for each contig

BABRIGGS opened this issue · 17 comments

flye.log

When I assemble various bacterial genomes each contig/genome seems to be doubled in size due to the entire contig being repeated and then concatenated together. One repeat seems to be the positive sense strand while the other seems to be the negative sense strand. This has happened to circular and linear contigs ranging from 16Kb to 10Mb.

Is there a way to combat this or correct it?

I have included the entire Flye log with this post.

Hello,

In your email you mentioned a 5Mb duplication, but your assembly size is 1.9 Mb. Is this the same dataset?

Which contigs have duplications and how did you detect this duplication?

Best,
Misha

If you have two bacteria in the sample, it is likely that Flye actually recovers two separate genome. What exactly do you mean by doubled and how do you quantify this?

Hi Barrett,

Sorry for my late response. I am not sure I 100% understand what do you mean by doubling. How do you quantify this? Is there an output of a tool, from which you conclude that there is doubling? Can you share and describe your interpretation?

Flye in general should not produce extensive duplications and I have not seen anything like this before. Could you please produce dot-plots of the large contigs (1mb+) that you think are duplicated? E.g. with tools like MashMap or gepard (https://cube.univie.ac.at/gepard)?

Hi @BABRIGGS

I unfortunately can't see the uploaded images.. I see that you responded via email - could you please try to add the images via github (#674)?

Thanks,
Misha

I have attached the images and flyle log below:

EX1:
ex1

EX2:
ex2

EX3:
ex3

Flye Log
flye.log

Thank you, this info is very helpful!

I would separate duplications in plasmids and chromosomal sequences. Duplications of plasmids is indeed a known issue with Flye. It is mostly relevant to circular sequences under 100kb. Here is a nice writeup by Ryan Wick that covers plasmids, but has some other useful info: https://rrwick.github.io/2020/10/30/guide-to-bacterial-genome-assembly.html

It is strange however that plasmids are duplicated into the opposite strand. Is it possible that these are linear plasmids, with some kind of inverted terminal repeats? And maybe captured at a weird stage of their replication cycle?

I have the same suspicion about your chromosome. In general, it is definitely unexpected for Flye to make artificial duplications of 100kb+. Could these be inverted terminal repeats?

It would be helpful if you can also upload the Bandage visualization of assembly_graph.gfa and assembly_info.txt file.

Thanks,
Misha

I have uploaded the info txt file, however, GitHub not let me upload the gfa file.
assembly_info.txt

The info file also indicates plasmids that are supposed to be linear as circular (contig 29, 35, 34,36,32, and 38). Contig 37 should be the only circular one. Contigs 17, 18, 19, 20, 21, and 22 make up a group of highly similar circular plasmids. We expected to have difficulty assembling them so I am not as concerned with those, but the others should not be displaying circular. Could that be why they are showing up duplicated? If that is the case, is there a way to fix that?

I would be highly suspicious of ITRs as there are duplicated genes that should not be at the ends of strands of the DNA. It is more so that the entire contig is duplicated/repeated.

I will take a look at RRwick's info.

Thanks so much again,
Barrett

Thanks Barrett,

For gfa - I don't need the actual file, but could you please use this tool to visualize and just post the image? https://github.com/rrwick/Bandage

From the assmebly_info, I see that there is definitely a mix of linear and circular. What I would also do - try visualizing read alignments against the assembly in IGV. You can check the alignments around the areas where you expect chromosome to end. If it's an artifact, you should see none or a few reads spanning these positions. Feel free to post those as well.

Here is my bandage file for the assembly that I sent the dot plots for.
graph.

Our lab is not familiar with IGV, besides the assembly file, what files should I be using for the read alignments to visualize this?

Thanks,
Barrett

Barrett - you'll need to realign original reads against the assembly and use bam file as an IGV input.

I've seen a similar pattern with one of my assemblies as well. The bandage image is below, edge 2 and 3 are the contigs of interest.I have screen shots of a BLAST dotplot (self to self) and coverage from Minimap visualized in Geneious. Edge 2 is the first pair of images. Edge 3 is the second pair of images.

contig 2 coverage
Contig 2 dotplot
contig 3 coverage
contig 3 dotplot
graph

Thanks- Adam

Adam - these could be self-complementary repeats (e.g. ATATATATA).

Closing due to inactivity - feel free to follow up if you have more questions!