epi2me-labs/wf-single-cell

Concerns with logic behind stringtie step

Closed this issue · 2 comments

Ask away!

I have been using this tool and thoroughly reading the documentation and I have become concerned with the addition of the stringtie step. My understanding from reading the documentation and looking at the source code is that stringtie is run to generate a denovo transcriptome based on the long read single cell data. Known transcripts are then named using gffcompare. While I agree that it is important to account for the possibility of identifying transcripts that haven't been annotated, I have several concerns with this method.

  1. This strategy will lead to each sample having a unique transcriptome and will make between sample comparisons impossible. For example, was a novel transcript found in both but not named the same because it wasn't in the supplied gtf? How can you then compare these transcripts downstream? To me, this question is best illustrated by best practices in ATAC-seq analysis. Here, you call peaks on samples individually, but before you do downstream analysis, you find a consensus peak set and recount all reads for each sample in the consensus peak set. The current strategy (especially with the hidden stringtie GTF file) makes this impossible
  2. Single cell RNA-seq is very sparse and will likely have artifacts from the library prep. To get around this, our group has previously generated a GTF file from a bulk long read RNA-seq experiment. This will identify the same denovo transcripts as the single cell but will likely have fewer artifacts. An additional option would be generating a GTF file based on all single cell libraries combined (after collapsing UMIs and finding consensus sequences) and then using this GTF file for transcriptome mapping. In either case, the stringtie step would not be required as a gtf file with denovo transcripts will have already been generated. In this case, the samples will all use the same gtf file and downstream analysis will not be impacted.
  3. Stringtie is very computationally expensive and has been failing for many users of this pipeline based on #22. It is a huge waste of computational resources to repeat this step independently for each sample.

I would suggest either removing the stringtie step or making it optional with the default being not running stringtie. You can provide documentation suggesting the use of stringtie, but because of the problems listed above, it would be best if users would need to knowingly turn on this option so that they would be aware that each sample had a unique transcriptome and couldn't be directly compared.

We are considering already the removal of the stringtie steps, this will be done after a current batch of work looking at simplifying and optimising other steps of the workflow.

That's great! I'm glad to hear it. Thanks