getzlab/rnaseqc

Failed to parse GTF: non-unique Exon ID

Closed this issue · 27 comments

I was trying to run the command:

rnaseqc.v2.3.5.linux ${REFDIR}/Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ${ID}.transcript.sorted.bam ${QCDIR}

But get the following error:
"Failed to parse the GTF: Detected non-unique Exon ID: ENSMMUE00000311984"

I grep'ed out the Exon ID, and this is what I see:

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000063154"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-201"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000076984"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-202"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";

1 ensembl exon 25432 25503 . + . gene_id "ENSMMUG00000005947"; gene_version "3"; transcript_id "ENSMMUT00000054447"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_name "SAMD11-203"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSMMUE00000311984"; exon_version "2";
.
.
.
Although the exon id is the same, the transcript id & name is different.

How should this be handled?

many thanks!

Is there any other information that I can provide that might help?

thanks!

Sorry for not getting back to you sooner. rnaseqc does not support full annotation gtfs. Please use this script to collapse your annotation, which effectively produces the union of transcripts for each gene. Let me know if there are additional problems or of you have any other questions

Thanks!
I get the following error:

python3 collapse_annotation.py ../../Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ../../Macaca_mulatta.Mmul_8.0.1.97.chr.uniq.exonid.gtf
Traceback (most recent call last):
File "collapse_annotation.py", line 6, in
from bx.intervals.intersection import IntervalTree
ModuleNotFoundError: No module named 'bx'

pip install bx-python

Thanks! I still seem to get an error:

/usr/bin/python3.6 collapse_annotation.py ../../Macaca_mulatta.Mmul_8.0.1.97.chr.gtf ../../Macaca_mulatta.Mmul_8.0.1.97.chr.uniq.exonid.gtf
Traceback (most recent call last):
File "collapse_annotation.py", line 249, in
annotation = Annotation(args.transcript_gtf)
File "collapse_annotation.py", line 68, in init
g = Gene(gene_id, attributes['gene_name'], attributes['gene_type'], chrom, strand, start_pos, end_pos)
KeyError: 'gene_type'

Make sure you're using the V9 branch that I linked above. The version of that file in the V9 branch can handle GTFs which use gene_biotype instead of gene_type

Hi, I'm getting a similar error KeyError:'transcript_type' after using the collapse_annotation.py script from the link that you shared. Any help would be appreciated.

Does the GTF you're collapsing contain the transcript_type or transcript_biotype fields?

Thanks for your response; I just checked- it contains neither.

Okay, both the collapsing script and RNA-SeQC require one of those fields to be populated (which gets converted to transcript_type during the collapsing process).

Thanks again for your response. Sorry, I'm not clear as to how I should proceed here. My gff file is from NCBI (exact link here: https://www.ncbi.nlm.nih.gov/genome/?term=txid8128[orgn]) and I ran "gffread -E GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gff -T -o GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf". I'm trying to run the collapsing script on GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf so that I can eventually get RNA-SeQC to work. Please let me know how I can modify further my GCF_001858045.2_O_niloticus_UMD_NMBU_genomic.gtf file.

@Anto007, sorry for the delay. We're working on a patch for the collapse_annotation.py script and will update you here when it's ready

Great news! Thank you so much.

The problem with your GTF is that it doesn't follow the specification at https://www.gencodegenes.org/pages/data_format.html. Our script requires that the top-level feature is a gene. You can either patch the GTF file, or try using a different converter.

Thank you for your response @francois-a; I was wondering if could suggest me a gff to gtf converter other than gffread using which I might have success at running rnaseqc?

Hello. I got the same error message when using RNAseQC 2.3.6. as the follow:

cmd
[1] "rnaseqc /data/test/gencode.v28.primary_assembly.annotation.gtf /data/test/aligned/star_ZAM1Aligned.sortedByCoord_added.bam /data/test/qualitycontrol/rna_seqc/test.bed"
system(cmd)
Failed to parse the GTF: Detected non-unique Exon ID: ENSE00001957285.1

I wanted to try agraubert's script (https://github.com/broadinstitute/gtex-pipeline/tree/v9/gene_model) to collapse the annotation, but the link goes to page error 404. Any suggestions? Thank you very much!

Hi,
Please use this link: https://github.com/broadinstitute/gtex-pipeline/tree/master/gene_model

Thank you, Francois. It works.

Apologies for unearthing a dead issue.

I was attempting to use v2.3.5 on TAIR10 annotation, but ran into an issue with collapsing the annotation, since some gene_name/gene_id fields include the ";" character, which was causing inappropriate splitting. To fix, I reran, but edited the collapse script to include a slightly more robust parser for attribute fields like so-

for a in re.finditer(r'(.+?) "(.+?)";', row[8].replace('_biotype', '_type')):  
    if a.group(1).strip() != 'tag':   
        attributes[a.group(1).strip()] = a.group(2)   
    else:   
         attributes['tags'].append(a.group(2))   

That seemed to work, and collapse was completed successfully (?). However, I ran into another issue while running rnaseqc- "Failed to parse the GTF: Detected non-unique Gene ID: AT1G01020".

Grepping through the annotation, I can see the gene id corresponds with two different isoforms, but those have separate transcript_ids. This shouldn't be an uncommon occurrence, so I'm not sure why that's a hindrance to this. Looking through it all, it seems like the exon_id seems to have just taken f"{gene_id}_{exon_num}" rather than using f"{transcript_id}_{exon_num}".

Can you recommend a solution to this please?

The two different gene isoforms should be listed as separate transcript entries, not genes. There should be one gene entry with a unique id, and a transcript entry for each isoform. The transcripts should have unique transcript ids, but share the same gene id.

hi @agraubert - just to clarify; is the implication here that the issue is stemming from there being multiple lines in the annotation with the third column set to 'gene' (for the same gene_id)? If so, that should give me some direction on how to re-process this annotation. Thank you!

Yes, that's exactly right! If an entry in the GTF declares itself as a gene, then RNA-SeQC expects it to have a unique gene_id entry.

gtex-pipeline/gene_model/collapse_annotation.py", line 294, in <module> annotation = Annotation(args.transcript_gtf) gtex-pipeline/gene_model/collapse_annotation.py", line 67, in __init__ attributes[kv[0]] = kv[1] IndexError: list index out of range

Its throwing an error like this

Hmmm. It appears to be caused by an attribute of some entry in the GTF not containing an actual key-value pair. This script expects all attributes to be a space-separated key-value pairs

I have downloaded the gff3 for TAIR10 from Ensembl and converted gff3 to gtf using agat_convert_sp_gff2gtf.pl from agat. Then I am using gtex-pipeline/gene_model/collapse_annotation.py to make it compatible with rna_seqc. So would you please suggest me that how would we process the gff to make it compatible with this script also without losing information?

@agraubert @ramsainanduri if this is the same annotation I was working on- the issue is that the conversion converts the gtf to include ";" characters inside the value attribute. For instance this line-

gene_source "ensembl"; gene_id "AT1G14040"; gene_biotype "protein_coding"; tss_id "TSS29975"; gene_version "1"; gene_name "PHO1;H3";

is an issue because it yields a value H3 that can't be broken down into key-value pairs. That's the whole reason I needed to rewrite that section of the parser using re (code in my first comment above). (In interests of full disclosure, I was using gffread from cufflinks for conversion, so I expect some parts to be different between the gtf files we have on hand)

Thanks @p-vishwanath , I will try your code and see if it works for me.