kharchenkolab/dropEst

Add support for stranded libraries?

Opened this issue · 8 comments

I'm wondering how possible it would be to create gene x cell matrices separated by strand in the case of stranded libraries. In other words, dropEst currently summarizes coverage over a given gene using any read that aligned there (I think), but not just those that were in the sense direction. If we are interested in anti-sense expression or want more confident gene-level summaries, having strand capabilities would be nice.

Do you think it would be doable to add that functionality?

In general, reads can be aligned on one of two strands. And a gtf file contains genes with information about strands. And you want to split it by the alignment direction, right? It's possible to add such functionality to the pipeline, though I'm not sure how many people would use it.
Anyway, at the moment the easiest way to do that would be to split your bam file on two, and run dropEst on each of them with some low threshold on minimal number of genes per cell. It would give you two strand-specific count matrices.

Our lab would be interested greatly in the addition of this functionality to the tool, do you think it might be possible to account for strandedness?

@jggatter , it shouldn't be that hard to add: add new flag to the dropest CLI, new field to the resulting rds file and some ifs to the bam parsing code. Still, at the moment we're busy with publishing of two papers, so, being honest, I won't have time to do it myself. Though I'm ready to provide any help and advise if you'd like to implement it.

I'd be interested in taking a look but currently I have several other priorities for my project! I may or may not reach out to you in the distant future. Thanks!

I just wanted to chime in to this matter if you don't mind; I recently discovered that when using -V option to produce the matrices for velocyto, some genes that are nested in an intron of another gene (but different strand) did not get counted correctly for intronic sequences. So if I understood the above conversation correctly, the current best way is to split the bam file and run twice dropEst with the transcriptome annotation also split into two? (I was using InDrops)

I'd be interested in taking a look but currently I have several other priorities for my project! I may or may not reach out to you in the distant future. Thanks!

Sure. Feel free to contact me whenever you decide.

So if I understood the above conversation correctly, the current best way is to split the bam file and run twice dropEst with the transcriptome annotation also split into two?

Yep. And you also need to split annotation by two.

It seems that all the single-cell technologies should be strand-specific because of the library preparation process, it's strange not to use that fact in gene counting. Or I'm not right