broadinstitute/Drop-seq

ConvertToReflFlat skips many transcripts

Opened this issue · 14 comments

auesro commented

Affected tool(s)

ConvertToRefFlat

Affected version(s)

  • Latest public release version [2.5.1]
  • Latest development/master branch as of [date of test?]

Description

I am analyzing a public dataset generated with Drop-seq GSE103892. In order to merge the resulting expression matrix to other 10x Genomics datasets, including my own, I have used the genome and gtf files as provided by Pool Lab, which represents an improved version over the 10x Genomics-provided reference files.

When running the provided script to generate the required Drop-seq metadata files, I observe that many transcripts and some exons are skipped at the ConvertToRefFlat step as can be observed in the attached log file.

Then, during the next step, ReduceGtf, the same skipping warnings are printed and, in addition, many warnings in the form of:
WARNING 2023-03-14 09:14:56 EnhanceGTFRecords gene GTFRecord(chrX:103356476-103396092 + . [Chic1 gene]) != GeneFromGTF(chrX:103356476-103409092 + Chic1) -- skipping are output...

Both types of warnings can be observed in the attached logfile: logfile.txt

The resulting refFlat file is 10 times smaller than the one provided by you in the Cookbook.

Is this normal?

Expected behavior

GTF to refFlat conversion includes all transcripts in the GTF

Actual behavior

Many transcripts are removed from refFlat.

Hi,

GTF files are poorly defined, so parsing files from different sources can be tricky. This is a good example.

Let’s take your first gene where you are warned the gene is skipped.

Here’s some junky R code so you can replicate, with the interesting output:

a=read.table("mouse_mm10_optimized_v2.gtf.gz", header=T, stringsAsFactors=F, sep="\t")
a=a[grep ("Chic1", a$V9),]
aa=a[a$V3=="transcript",]
aa

          V1     V2         V3        V4        V5 V6 V7 V8
1955023 chrX HAVANA transcript 103356476 103396092  .  +  .
1955040 chrX HAVANA transcript 103356476 103396092  .  +  .
                                                                                                                                                                                                                                                                                                                                                                                                                            V9
1955023 gene_id ENSMUSG00000031327; gene_version 10; gene_type protein_coding; gene_name Chic1; level 2; mgi_id MGI:1344694; havana_gene OTTMUSG00000018247.1; transcript_id ENSMUST00000116547; transcript_version 2; transcript_type protein_coding; transcript_name Chic1-201; transcript_support_level 1; havana_transcript OTTMUST00000044104.1; protein_id ENSMUSP00000112246.2; tag CCDS; ccdsid CCDS53165.1; ID 1747380
 
1955040 gene_id ENSMUSG00000031327; gene_version 10; gene_type protein_coding; gene_name Chic1; level 2; mgi_id MGI:1344694; havana_gene OTTMUSG00000018247.1; transcript_id ENSMUST00100116547; transcript_version 2; transcript_type protein_coding; transcript_name Chic1-201; transcript_support_level 1; havana_transcript OTTMUST00000044104.1; protein_id ENSMUSP00000112246.2; tag CCDS; ccdsid CCDS53165.1; ID 2308611

The only difference in these two records is the ID field at the end. All of the standard identifiers (transcript name, transcript ID, etc) are the same. Since this lab has come up with their own unique field to define that these two records are unique, it's not at all surprising that software not written by them would not know how to interpret this field.

To use this GTF with our toolkit you'd have to modify it so the transcript_id field is unique for these two records. Perhaps concatenating the existing transcript ID + new ID would work? I'd start by isolating this one gene in it's own GTF and trying to parse it before and after your modifications to see if you have a reasonable solution.

As for your expected behavior: I have yet to see a GTF where all transcripts are included in the refFlat file - there are a handful of genes that have the same gene symbol and appear on many chromosomes (and are thus not unique) that are filtered from the standard Ensembl GTF file. Perhaps reprocessing the 10x GTF file would include (almost?) all transcripts as they have spent some effort cleaning their GTF file before release.

Any idea why there are two records for this transcript, given that they have the same coordinates and information? Perhaps they are defining alternate sets of used exons but happen to use the same first and last exon? All the exons point to the same transcript_id, so I'm really not sure what's going on here.

auesro commented

Hi James,

Thanks for your quick reply.

I was afraid I could not "easily" use the same GTF for the Drop-seq samples but I had to give it a try!

Agreed on my hyperbole regarding the expected behavior, typing too fast.

I have no idea why those transcripts are duplicated. I went and checked the files they provide detailing the changes to the GTF, but I didnt find the explanation there (caveat: im a biologist, not a bioinformatician!).

I will give it a try to your solution editing the GTF file but it would be nice to know why they went to the trouble of duplicating transcripts and assigning new IDs in the first place...

In addition, and probably more important, do you think the massive skipping of duplicated records during the generation of the refFlat will have any impact on DigitalExpression? I am assuming that 1 of the duplicated transcripts is retained in the refFlat file, right?

auesro commented

Yea...that's what I suspected when I saw the results of DGE today morning and that's why I started looking around...

It seems even more complicated. The file you looked at (mouse_mm10_optimized_v2.gtf.gz) is the new version of that GTF, I am using the previous version (mouse_mm10_optimized_v2.gtf.gz) where, if I search for Chic1, just to follow with your example:

(in202109) [lab_eh_2@NeuroServer PoolLab]$ grep -w Chic1 PoolLab.gtf
chrX	HAVANA	gene	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; gene_type "protein_coding"; gene_name "Chic1"; level "2"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; ID "1747379";
chrX	HAVANA	transcript	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747380";
chrX	HAVANA	exon	103356476	103356889	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747381";
chrX	HAVANA	CDS	103356585	103356889	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747382";
chrX	HAVANA	start_codon	103356585	103356587	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747383";
chrX	HAVANA	exon	103366214	103366268	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747384";
chrX	HAVANA	CDS	103366214	103366268	.	+	1	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747385";
chrX	HAVANA	exon	103373599	103373754	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747386";
chrX	HAVANA	CDS	103373599	103373754	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747387";
chrX	HAVANA	exon	103387255	103387311	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747388";
chrX	HAVANA	CDS	103387255	103387311	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747389";
chrX	HAVANA	exon	103387987	103388046	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747390";
chrX	HAVANA	CDS	103387987	103388046	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747391";
chrX	HAVANA	exon	103389460	103409092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747392";
chrX	HAVANA	CDS	103389460	103389507	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747393";
chrX	HAVANA	stop_codon	103389508	103389510	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747394";
chrX	HAVANA	UTR	103356476	103356584	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747395";
chrX	HAVANA	UTR	103389508	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "1747396";
chrX	HAVANA	exon	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "115431"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "11543110";

Only 1 transcript appears, right? Why is that a problem when parsing?

You'll need to give me a link to the GTF file you're parsing.

auesro commented

I thought so but the Pool lab have removed access to the version 1. No problem: here you go

auesro commented

I assume that the GTFParser called by ConvertToRefFlat looks at the "transcript_name" field, right? However, when I build the metadata using another 10x reference (mm10 1.2.0, I happened to have that one sitting around), and even if the search for Chic1 is:

(base) auesro@AER-PC1-LM:~/Desktop$ grep -w Chic1 mm10_1.2.0.gtf
X	ensembl_havana	gene	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1";
X	ensembl_havana	transcript	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103356476	103356889	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "1"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000386478"; exon_version "5"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103356585	103356889	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "1"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	start_codon	103356585	103356587	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "1"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103366214	103366268	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "2"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000284206"; exon_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103366214	103366268	.	+	1	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "2"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103373599	103373754	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "3"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000336993"; exon_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103373599	103373754	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "3"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103387255	103387311	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "4"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000364324"; exon_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103387255	103387311	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "4"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103387987	103388046	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "5"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000404929"; exon_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103387987	103388046	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "5"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	exon	103389460	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "6"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; exon_id "ENSMUSE00000364216"; exon_version "7"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	CDS	103389460	103389507	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "6"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; protein_id "ENSMUSP00000112246"; protein_version "2"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	stop_codon	103389508	103389510	.	+	0	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; exon_number "6"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	five_prime_utr	103356476	103356584	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
X	ensembl_havana	three_prime_utr	103389511	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "ENSMUST00000116547"; transcript_version "2"; gene_name "Chic1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000018247"; havana_gene_version "1"; transcript_name "Chic1-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS53165"; havana_transcript "OTTMUST00000044104"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";

This GTF doesnt give out any warning regarding Chic1...

Can you also give me the sequence dictionary you used to call ConvertToRefFlat? My mm10 reference does not have matching contig names.

auesro commented

Sure, here

So, I at least have a sense of the problem now for the Chic1 example.

Most features have the same transcript ID and transcript name:

transcript_name "Chic1-201"
transcript_id "ENSMUST00000116547"

One exon is a weird outlier and confuses parsing, having the same transcript name, but a new transcript ID.

chrX	HAVANA	exon	103356476	103396092	.	+	.	gene_id "ENSMUSG00000031327"; gene_version "10"; transcript_id "115431"; transcript_version "2"; gene_type "protein_coding"; gene_name "Chic1"; transcript_type "protein_coding"; transcript_name "Chic1-201"; level "2"; transcript_support_level "1"; mgi_id "MGI:1344694"; havana_gene "OTTMUSG00000018247.1"; havana_transcript "OTTMUST00000044104.1"; protein_id "ENSMUSP00000112246.2"; tag "CCDS"; ccdsid "CCDS53165.1"; ID "11543110";

In your examples above, this was the last line in the problematic example. The exon spans the entire gene, which seems a bit unusual.

Our parser is clustering GTF lines on the transcript ID, then adding them to the gene model by the transcript name (which is usually 1:1 unique with the transcript ID!) What transcript does that exon belong to? Is it a new transcript for the gene, or the same transcript as the other features? Maybe it's time to consider having a discussion with the author of the GTF file to decide exactly what this exon is.

auesro commented

Lets see if I get a reply.

In the meantime, my problem would be solved if the parser would look only at the transcript_name? Do you think that would be easy to implement?

Maybe @jamesnemesh you can somehow integrate AGAT in your pipeline for uniform GTF processing: https://github.com/NBISweden/AGAT?

We don't distribute R tools, but one might imagine pre-processing your GTF file with this tool, and having the output be a valid input for Drop-Seq tools.

Does this not work currently? Do you have an example GTF output that you've processed this way and throws errors?

@jamesnemesh: I have a GTF file that was created by AGAT that throws warnings (specifically: mostly Chromosome/Strand disagreement ones) when running ConvertToRefFlat. If there is anything I can do to help trouble-shooting this issue, let me know.