"Can't find chromosome" warning

Question

"Can't find chromosome" warning

ajwilk opened this issue 5 years ago · 4 comments

I'm trying to identify virus-derived reads in a scRNA-seq dataset. To do this, I've built a dummy chromosome containing the virus sequence to use for alignment. While I can see in the .bam that some of my samples have virus-aligned reads, the final count matrix output of dropEst does not contain any of these reads. I'm guessing it has something to do with this error in the dropEst stage:

Start parsing of bams: 16:06:44. WARNING: Can't find chromosome 'F'

F being my fake chromosome. I did refer to the correct GTF containing the dummy chromosome using the -g flag. How do I get around this?

Answer 1 · 2019-04-18T00:17:26.000Z

Can you please double-check if there are entries with type "exon" or "intron" in the 3'rd column for your chromosome? And that they have "transcript_id" parameter in the last column. Or if you can share your file, I can look by myself.

Answer 2 · 2019-04-18T01:05:48.000Z

Yup, looks like it does:

$ head h37.gtf #!genebuild-last-updated 2013-09
#!genome-build GRCh37.p13
#!genome-build-accession NCBI:GCA_000001405.14
#!genome-date 2009-02
#!genome-version GRCh37
F       protein_coding  exon    1       2341    .       +       .       gene_name "Influenza_PB1"; transcript_id "Influenza_PB1"; gene_id "Influenza_PB1"; transcript_name "Influenza_PB1"
1       lincRNA exon    29554   30039   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000473358"; exon_number "1"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-001"; transcript_source "havana"; exon_id "ENSE00001947070";
1       lincRNA transcript      29554   31097   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000473358"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-001"; transcript_source "havana";
1       lincRNA gene    29554   31109   .       +       .       gene_id "ENSG00000243485"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA";
1       lincRNA exon    30267   30667   .       +       .       gene_id "ENSG00000243485"; transcript_id "ENST00000469289"; exon_number "1"; gene_name "MIR1302-10"; gene_source "ensembl_havana"; gene_biotype "lincRNA"; transcript_name "MIR1302-10-002"; transcript_source "havana"; exon_id "ENSE00001841699";

Let me know if you'd like me to share the whole file, thanks!

Answer 3 · 2019-04-18T02:25:18.000Z

Hm, it's actually strange. Error you posted mean that the pipeline can't parse chromosome "F" from gtf. It doesn't depend on bam file. But when I pass just this header to the gtf parser, which is used in the pipeline, it parses this chromosome successfully.
Btw, line you sent me by e-mail differs from what you show here in exactly 3'rd column:

F              protein_coding  transcript             1              2341      .               +              .               gene_name "Influenza_PB1"

and when I change exon to transcript, the parser doesn't see the chromosome anymore (as expected). Are you sure that you've run the pipeline with exactly this version of the file?

Answer 4 · 2019-04-18T16:31:37.000Z

Looked like from my notes that I was experiencing the error with the GTF I posted above. But just tried it again and it's working. My mistake, thanks for your help