natir/yacrd

Order of Chimera detection and scrubbing

Opened this issue · 2 comments

Hi

I have been given some fastq files post demultiplexing via Guppy and I was thinking to checking chimera reads. This data is amplicon data from FMD virus (Amplicon size 400bp). The genome of mRNA virus is 8.3Kb long (I know you said yacrd is only for DNA direct sequencing here and ours is cDNA). And I was looking for best practices of analyzing this data. I wish to understand if I should run Chimera detection step first followed by read scrubbing or vice versa. I realized that scrubbing tend to split some chimeric reads in an issue here. So I was thinking to first perform chinera detection> splitting those reads as suggested in issue above > perform scrubbing on resulting fastq file?

The issue is we don't know weather to expect chimera reads or not because we are still experimenting around and we would like to know if there are such reads. So this is exploratory question.

I tried running the scrubbing and chimera detection codes separately. I can see that scrubbing also detects some chimera (in the yacrd file) and so I decided to see if these chimeras are the same as the ones we get when running the Chimera detection code. But there is a small overlap.

Chimera_code: Using chimera code only
ChimeraFrmScrub:Reads tagged as chimera when running Scrubb code only
ScrubChimNotCov: Reads tagged as chimera + Reads tagged as NotCovered when running Scrubb code only

The numbers in the Venn diagram are reads
image

From the Venn diagram, I can conclude that the scrubbing operation though lossy alone is sufficient to take care of both Chimeric reads and bad quality reads isn't it? So rather than running Chimera removal first and then Scrubbing will not only be time-consuming but will be redundant?

natir commented

Hello,

First of all, thank you for using yacrd.

In fact, the yacrd algorithm is divided into a detection step for poor quality zones (aka with poor coverage), a characterization step for these zones and eventually a processing step for these poor quality zones.

The characterization of poor-quality zones is based above all on their position: a zone in the middle of a read is considered chimerical, and so the read.

Between scrubbing and chimera management, it's only the last step that changes. The .yacrd file is produced by the second step.

If you manage chimeras by splitting the reads will be cut by yacrd, if you use scrubbing these same reads will also be cut, the ends of these reads will also be deleted as scrubbing removes all poor-quality regions.

To answer your question, which if I've understood correctly is whether to run chimera splitting followed by scrubbing or whether scrubbing is enough. Scrubbing cuts out the chimeras, so we can dispense with chimera splitting. However, remapping after chimera removal could result in better quality scrubbing. It's up to you to decide whether this potential increase in quality justifies the expenditure of mapping and analysis time.

If you have any interesting results, I'd be delighted to integrate your recommendations for use into the yacrd readme, indicating any publications.

If you have any interesting results, I'd be happy to add your recommendations for use to the yacrd readme, indicating that it's your contribution and citing any publications.

Thanks again for your interest and contribution.