Failed to parse GTF: non-unique Exon ID

Question

Failed to parse GTF: non-unique Exon ID

Closed this issue 4 years ago · 27 comments

I was trying to run the command:

rnaseqc.v2.3.5.linux ${REFDIR}/Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ${ID}.transcript.sorted.bam ${QCDIR}

But get the following error:
"Failed to parse the GTF: Detected non-unique Exon ID: ENSMMUE00000311984"

I grep'ed out the Exon ID, and this is what I see:

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000063154"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000076984"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-202"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000054447"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-203"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";
.
.
.
Although the exon id is the same, the transcript id & name is different.

How should this be handled?

many thanks!

Answer 1 · 2020-02-27T13:12:27.000Z

Is there any other information that I can provide that might help?

thanks!

Answer 2 · 2020-02-27T13:56:34.000Z

Sorry for not getting back to you sooner. rnaseqc does not support full annotation gtfs. Please use this script to collapse your annotation, which effectively produces the union of transcripts for each gene. Let me know if there are additional problems or of you have any other questions

Answer 3 · 2020-02-27T23:37:28.000Z

Thanks!
I get the following error:

python3 collapse_annotation.py ../../Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ../../Macaca_mulatta.Mmul_8.0.1.97.chr.uniq.exonid.gtf
Traceback (most recent call last):
File "collapse_annotation.py", line 6, in
from bx.intervals.intersection import IntervalTree
ModuleNotFoundError: No module named 'bx'

Answer 4 · 2020-02-28T05:19:39.000Z

pip install bx-python

Answer 5 · 2020-02-28T19:06:08.000Z

Thanks! I still seem to get an error:

/usr/bin/python3.6 collapse_annotation.py ../../Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ../../Macaca_mulatta.Mmul_8.0.1.97.chr.uniq.exonid.gtf
Traceback (most recent call last):
File "collapse_annotation.py", line 249, in
annotation = Annotation(args.transcript_gtf)
File "collapse_annotation.py", line 68, in init
g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'

Answer 6 · 2020-02-28T20:14:57.000Z

Make sure you're using the V9 branch that I linked above. The version of that file in the V9 branch can handle GTFs which use gene_biotype instead of gene_type

Answer 7 · 2020-03-08T19:35:46.000Z

Hi, I'm getting a similar error KeyError:'transcript_type' after using the collapse_annotation.py script from the link that you shared. Any help would be appreciated.

Answer 8 · 2020-03-09T14:04:39.000Z

Does the GTF you're collapsing contain the transcript_type or transcript_biotype fields?

Answer 9 · 2020-03-09T14:33:58.000Z

Thanks for your response; I just checked- it contains neither.

Answer 10 · 2020-03-09T19:14:02.000Z

Okay, both the collapsing script and RNA-SeQC require one of those fields to be populated (which gets converted to transcript_type during the collapsing process).

Answer 11 · 2020-03-10T06:31:18.000Z

Thanks again for your response. Sorry, I'm not clear as to how I should proceed here. My gff file is from NCBI (exact link here: https://www.ncbi.nlm.nih.gov/genome/?term=txid8128[orgn]) and I ran "gffread -E GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gff -T -o GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf". I'm trying to run the collapsing script on GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf so that I can eventually get RNA-SeQC to work. Please let me know how I can modify further my GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf file.

Answer 12 · 2020-03-12T14:43:42.000Z

@Anto007, sorry for the delay. We're working on a patch for the collapse_annotation.py script and will update you here when it's ready

Answer 13 · 2020-03-12T15:02:52.000Z

Great news! Thank you so much.

Answer 14 · 2020-03-13T23:54:12.000Z

The problem with your GTF is that it doesn't follow the specification at https://www.gencodegenes.org/pages/data_format.html. Our script requires that the top-level feature is a gene. You can either patch the GTF file, or try using a different converter.

Answer 15 · 2020-03-14T22:05:09.000Z

Thank you for your response @francois-a; I was wondering if could suggest me a gff to gtf converter other than gffread using which I might have success at running rnaseqc?

Answer 16 · 2020-08-19T17:21:29.000Z

Hello. I got the same error message when using RNAseQC 2.3.6. as the follow:

cmd
[1] "rnaseqc /data/test/gencode.v28.primary_assembly.annotation.gtf /data/test/aligned/star_ZAM1Aligned.sortedByCoord_added.bam /data/test/qualitycontrol/rna_seqc/test.bed"
system(cmd)
Failed to parse the GTF: Detected non-unique Exon ID: ENSE00001957285.1

I wanted to try agraubert's script (https://github.com/broadinstitute/gtex-pipeline/tree/v9/gene_model) to collapse the annotation, but the link goes to page error 404. Any suggestions? Thank you very much!

Answer 17 · 2020-08-19T17:23:31.000Z

Hi,
Please use this link: https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model

Answer 18 · 2020-08-19T21:52:38.000Z

Hi,
Please use this link: https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model

Thank you, Francois. It works.

Answer 19 · 2021-09-22T19:54:03.000Z

Apologies for unearthing a dead issue.

I was attempting to use v2.3.5 on TAIR10 annotation, but ran into an issue with collapsing the annotation, since some gene_name/gene_id fields include the ";" character, which was causing inappropriate splitting. To fix, I reran, but edited the collapse script to include a slightly more robust parser for attribute fields like so-

for a in re.finditer(r'(.+?) "(.+?)";', row[8].replace('_biotype', '_type')):  
    if a.group(1).strip() != 'tag':   
        attributes[a.group(1).strip()] = a.group(2)   
    else:   
         attributes['tags'].append(a.group(2))

That seemed to work, and collapse was completed successfully (?). However, I ran into another issue while running rnaseqc- "Failed to parse the GTF: Detected non-unique Gene ID: AT1G01020".

Grepping through the annotation, I can see the gene id corresponds with two different isoforms, but those have separate transcript_ids. This shouldn't be an uncommon occurrence, so I'm not sure why that's a hindrance to this. Looking through it all, it seems like the exon_id seems to have just taken f"{gene_id}_{exon_num}" rather than using f"{transcript_id}_{exon_num}".

Can you recommend a solution to this please?

Answer 20 · 2021-09-22T22:04:41.000Z

The two different gene isoforms should be listed as separate transcript entries, not genes. There should be one gene entry with a unique id, and a transcript entry for each isoform. The transcripts should have unique transcript ids, but share the same gene id.

Answer 21 · 2021-09-22T23:39:54.000Z

hi @agraubert - just to clarify; is the implication here that the issue is stemming from there being multiple lines in the annotation with the third column set to 'gene' (for the same gene_id)? If so, that should give me some direction on how to re-process this annotation. Thank you!

Answer 22 · 2021-09-24T02:00:28.000Z

Yes, that's exactly right! If an entry in the GTF declares itself as a gene, then RNA-SeQC expects it to have a unique gene_id entry.

Answer 23 · 2021-09-29T06:08:08.000Z

gtex-pipeline/gene_model/collapse_annotation.py", line 294, in <module> annotation = Annotation(args.transcript_gtf) gtex-pipeline/gene_model/collapse_annotation.py", line 67, in __init__ attributes[kv[0]] = kv[1] IndexError: list index out of range

Its throwing an error like this

Answer 24 · 2021-10-01T03:19:04.000Z

Hmmm. It appears to be caused by an attribute of some entry in the GTF not containing an actual key-value pair. This script expects all attributes to be a space-separated key-value pairs

Answer 25 · 2021-10-01T07:03:01.000Z

I have downloaded the gff3 for TAIR10 from Ensembl and converted gff3 to gtf using agat_convert_sp_gff2gtf.pl from agat. Then I am using gtex-pipeline/gene_model/collapse_annotation.py to make it compatible with rna_seqc. So would you please suggest me that how would we process the gff to make it compatible with this script also without losing information?

Answer 26 · 2021-10-01T15:49:12.000Z

@agraubert @ramsainanduri if this is the same annotation I was working on- the issue is that the conversion converts the gtf to include ";" characters inside the value attribute. For instance this line-

gene_source "ensembl"; gene_id "AT1G14040"; gene_biotype "protein_coding"; tss_id "TSS29975"; gene_version "1"; gene_name "PHO1;H3";

is an issue because it yields a value H3 that can't be broken down into key-value pairs. That's the whole reason I needed to rewrite that section of the parser using re (code in my first comment above). (In interests of full disclosure, I was using gffread from cufflinks for conversion, so I expect some parts to be different between the gtf files we have on hand)

Answer 27 · 2021-10-07T11:39:12.000Z

Thanks @p-vishwanath , I will try your code and see if it works for me.