HumanCellAtlas/skylab

Running Optimus on Read1 with length greater than 26 base pairs

jychien opened this issue · 1 comments

I am looking to run Optimus on a set of fastq files downloaded from the SRA. This particular dataset has Read1 and Read2 lengths of 150bp. Does Optimus require that the fastq files have extra base pairs clipped off? Does Read1 have to be 26 base pairs in length for 10X v2?

Thank you!

This resulted in a conversation via email, to follow up for the future, the general answer was:

It is very unusual for the 10X dataset, like the ones that Optimus processes, to have a different read structure than the standard one. There are ways in which this can happen, for example if the sample is multiplexed with other libraries that themselves require longer sequencing -- but still unusual. I would double check to make sure the data are indeed what I think they are.

Regarding Optimus handling these reads: Optimus should only use the bases it needs according to the chemistry settings for Read1 and the entirety of the length of read 2. It's possible that we have put a check that the overall read length is as expected in which case Optimus will fail, but if that happens its straightforward to add a flag to allow that.

Regarding read2, to the best of my knowledge, the underlying aligner (STAR) uses softclipping and I just checked Optimus doesn't turn that off. If your reads are longer than your insert size you probably end up reading into the adaptor. These sequences will get softclipped by STAR if that happens. Keep in mind however that too much adaptor can result in lower alignment scores in general so you can either use TrimGalore to remove the adaptor or just hardclip the reads.