kharchenkolab/dropEst

"Can't find chromosome" warning

ajwilk opened this issue · 4 comments

I'm trying to identify virus-derived reads in a scRNA-seq dataset. To do this, I've built a dummy chromosome containing the virus sequence to use for alignment. While I can see in the .bam that some of my samples have virus-aligned reads, the final count matrix output of dropEst does not contain any of these reads. I'm guessing it has something to do with this error in the dropEst stage:

Start parsing of bams: 16:06:44. WARNING: Can't find chromosome 'F'

F being my fake chromosome. I did refer to the correct GTF containing the dummy chromosome using the -g flag. How do I get around this?

Can you please double-check if there are entries with type "exon" or "intron" in the 3'rd column for your chromosome? And that they have "transcript_id" parameter in the last column. Or if you can share your file, I can look by myself.

Yup, looks like it does:

$ head h37.gtf #!genebuild-last-updated 2013-09
#!genome-build GRCh37.p13
#!genome-build-accession NCBI:GCA_000001405.14
#!genome-date 2009-02
#!genome-version GRCh37
F       protein_coding  exon    1       2341    .       +       .       gene_name "Influenza_PB1"; transcript_id "Influenza_PB1"; gene_id "Influenza_PB1"; transcript_name "Influenza_PB1"
1       lincRNA exon    29554   30039   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000473358"; exon_number "1"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-001"; transcript_source "havana"; exon_id "ENSE00001947070";
1       lincRNA transcript      29554   31097   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000473358"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-001"; transcript_source "havana";
1       lincRNA gene    29554   31109   .       +       .       gene_id "ENSG00000243485"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA";
1       lincRNA exon    30267   30667   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000469289"; exon_number "1"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-002"; transcript_source "havana"; exon_id "ENSE00001841699";

Let me know if you'd like me to share the whole file, thanks!

Hm, it's actually strange. Error you posted mean that the pipeline can't parse chromosome "F" from gtf. It doesn't depend on bam file. But when I pass just this header to the gtf parser, which is used in the pipeline, it parses this chromosome successfully.
Btw, line you sent me by e-mail differs from what you show here in exactly 3'rd column:

F              protein_coding  transcript             1              2341      .               +              .               gene_name "Influenza_PB1"

and when I change exon to transcript, the parser doesn't see the chromosome anymore (as expected). Are you sure that you've run the pipeline with exactly this version of the file?

Looked like from my notes that I was experiencing the error with the GTF I posted above. But just tried it again and it's working. My mistake, thanks for your help