malonge/RaGOO

Potential chimeric contig persists after RaGOO

gbdias opened this issue ยท 8 comments

Hi Michael,

  • I've used RaGOO to scaffold my PacBio contigs against a Sanger reference genome. Here is my command:
ragoo.py -b -g 100 -i 0.6 -C "$ASMN" "$REFN"
  • Here is the alignment between the reference I used (y-axis) and the resulting RaGOO scaffolds (x-axis).

s160_2_edited_to_ragoo paf plot

  • Everything looks correct except for the highlighted region at the end of chromosome E. Below is a zoom of that region:

Screen Shot 2019-11-13 at 3 42 52 PM

  • I am confused as to why do I still see a difference between the arrangement of my scaffolds and the reference I used.
  • I understand that my assembly has some extra sequence that is not present in the reference, but I don't understand why that inverted segment on the right side is not placed in the correct spot and orientation.

Let me know if you have any tips to help me understand this.

Hi there,

I have 3 recommendations for you:

  1. Try lowering the values for -d and -c. These command line arguments are currently hidden, but I suggest you try -d 100000 -c 100000 to get higher sensitivity with breaking.

  2. You may need to run ragoo a few times iteratively to make this correction since it only makes on break per contig. Just take the broken contigs from the intermediate output and use those as input into another round of ragoo.

  3. If you have short reads or error-correct long reads, the best option is to use -T -R to correct misassemblies. This mode generally works better than -b. Though improvements to the -b are currently on my TODO list so as to avoid these sorts of things in the future.

Thanks for the tips!

  • I couldn't find a description of -d and -c behavior. Could you share some info about their meaning?

  • Running RaGOO a second time on the <prefix>.intra.chimera.broken.fa (without -d and -c) did not change the result.

  • Running RaGOO with -d and -c set to 100000 works and produces the expected outcome, increasing the number of intrachromosomal breaks from 17 to 48. ๐Ÿ‘

Yeah these have been hidden parameters, but I am going to add them to the usage message now.

When breaking intrachromosomally chimeric contigs, the distance between consecutive contigs is taken. If that distance exceeds certain thresholds, it is deemed chimeric.

This distance can be with respect to the reference or the contig. -d sets the threshold for the distance with respect to the reference and is set to 2 Mbp by default. -c is with respect to the contig, and that is 1 Mbp by default.

In other words if two consecutive alignments are > 1Mbp apart from each other with respect to the contig or >2 Mbp apart from each other with respect to the reference, it will break the contig.

These are clearly quite large numbers and are conservative by default.

The parameters are no longer hidden. Thanks for bringing it up.

Hi @malonge, I have one additional question regarding the behaviour of -c and -d.
Do both -c and -d need to be met for the contig to be broken, or is it at least one of them?

Hi there,

Just one of them needs to be met.

Another related question: when the gap between adjacent alignments is met, where does RaGOO introduces the break? At the beginning or end of the gap?

Screen Shot 2019-12-17 at 10 20 21

I believe that it would break at the left breakpoint in your example.