danilotat/UTR_add_extend_GTF

Issue with gtf_advanced_parser.py and Ensembl GTF

chvandijk opened this issue · 7 comments

Hi,

I would like to use this script. However, when I want to run the advanced gtf parser script on the esembl gtf (Homo_sapiens.GRCh38.98.gtf) I get the following error:

python3 gtf_advanced_parser.py --input $gtf --output $out

Traceback (most recent call last):
  File "gtf_advanced_parser.py", line 102, in fill_missing_utrs
    utr_three=three_prime_utrs_records[transcript_id]
KeyError: '"ENST00000478729"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "gtf_advanced_parser.py", line 194, in <module>
    fill_missing_utrs(gene_id_records)
  File "gtf_advanced_parser.py", line 106, in fill_missing_utrs
    last_cds= GTF_record(*cdss_records[transcript_id][-1].split('\t')) 
KeyError: '"ENST00000478729"'

This transcripts happens to be a lncRNA, maybe this has to do something with it? This is what the lines of this transcript look like in the gtf:

1	havana	transcript	940346	942173	.	+	.	gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; transcript_support_level "5";
1	havana	exon	940346	940462	.	+	.	gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00001953327"; exon_version "1"; transcript_support_level "5";
1	havana	exon	941144	941306	.	+	.	gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "2"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003686968"; exon_version "1"; transcript_support_level "5";
1	havana	exon	942136	942173	.	+	.	gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "3"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00001881351"; exon_version "1"; transcript_support_level "5";

Hi @chvandijk , sorry for such a long delay. I was aware of errors like yours' and planned long time ago to fix them, but had literally no time. I've completely rewritten the code with the last commit (7340123) fixing known errors like this. Feel free to reopen the issue if you spot any other problems.

Hello @danilotat. I'm getting a similar error but it isn't clear what the problem is; it's starting from the very beginning. Here is the command and error:

client.py --input braker_annotations_anysupport.names.exonUTRs_10Xdata.cleaned.gtf --output braker_annotations_anysupport.names.exonUTRs_10Xdata.cleaned.extended.gtf --length 1000 --min_dist 20 --logs extension.log

Processing chromosome BsChromosome1
Traceback (most recent call last):
File "/home/jgoodheart/mendel-nas1/programs/UTR_add_extend_GTF/client.py", line 18, in
chromosome_wise(args.input, args.output, args.length, int(args.min_dist), args.logs)
File "/mendel-nas1/jgoodheart/programs/UTR_add_extend_GTF/gtftools/workers.py", line 343, in chromosome_wise
other_entries(entry, transcripts)
File "/mendel-nas1/jgoodheart/programs/UTR_add_extend_GTF/gtftools/workers.py", line 36, in other_entries
transcript_collection[transcript_id]._add_exon(entry)
KeyError: 'jg22355.t1'

And this is what the gtf file looks like:
BsChromosome1 AUGUSTUS exon 613869 613957 0.84 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "initial"; BsChromosome1 AUGUSTUS exon 616908 616923 0.95 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 624805 624846 0.95 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 625694 625780 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 626621 626679 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 628091 628258 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 629261 629363 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 635892 636044 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 636422 636508 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 638788 638910 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";

Hi @goodgodric28 , is the gtf only composed of exons (referring to the 3rd column)? If not, please attach a more comprehensive example spanning multiple genes.

BsChromosome1	AUGUSTUS	exon	613869	613957	0.84	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "initial";
BsChromosome1	AUGUSTUS	exon	616908	616923	0.95	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	624805	624846	0.95	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	625694	625780	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	626621	626679	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	628091	628258	1	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	629261	629363	1	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	635892	636044	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	636422	636508	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	638788	638910	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	640263	640298	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "terminal";
BsChromosome1	peaks2utr	exon	640299	644258	.	+	.	transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1	AUGUSTUS	intron	616924	624804	0.95	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	624847	625693	0.95	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	625781	626620	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	626680	628090	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	628259	629260	1	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	629364	635891	1	+	1	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	636045	636421	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	636509	638787	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	638911	640262	1	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1	AUGUSTUS	intron	613958	616907	0.84	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1	gffutils_derived	transcript	613869	644500	.	+	.     transcript_id "jg22355.t1"; gene_id "jg22355,jg22355.t1";
BsChromosome1	AUGUSTUS	start_codon	613869	613871	.	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1	AUGUSTUS	stop_codon	640296	640298	.	+	0	transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1	peaks2utr	three_prime_UTR	640299	644258	.	+	.	transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1	peaks2utr	exon	640299	644258	.	+	.	transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1	peaks2utr	three_prime_UTR	644259	644500	.	+	.	gene_id "jg22355"; transcript_id "jg22355.t1"; colour "4";
BsChromosome1	peaks2utr	exon	644259	644500	.	+	.	gene_id "jg22355"; transcript_id "jg22355.t1"; colour "4";
BsChromosome1	AUGUSTUS	exon	741527	741652	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "initial";
BsChromosome1	AUGUSTUS	exon	743281	743392	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	744558	744601	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	744870	744940	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	745652	745701	1	+	1	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	746565	746638	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	747605	747655	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	749195	749330	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	749684	749778	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1	AUGUSTUS	exon	751239	751418	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "terminal";
BsChromosome1	AUGUSTUS	intron	741653	743280	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	743393	744557	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	744602	744869	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	744941	745651	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	745702	746564	1	+	1	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	746639	747604	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	747656	749194	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	749331	749683	1	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	AUGUSTUS	intron	749779	751238	1	+	2	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1	gffutils_derived	transcript	741527	751634	.	+	.	transcript_id "jg22362.t1"; gene_id "jg22362.t1,jg22362";
BsChromosome1	AUGUSTUS	start_codon	741527	741529	.	+	0	transcript_id "jg22362.t1"; gene_id "jg22362"; supported "False";

Spotted some issues:

  • The provided example lacks of entries regarding genes, as expected in a standard Ensembl GTF file. This is key because the tool works by creating a hierarchical representation of the GTF made by genes-transcripts-features. Is this expected?
  • By default the extension is computed only on protein_coding genes, to resemble the biological meaning of an untranslated region. This tag is extrapolated here
    if feat_dict["gene_biotype"] == 'protein_coding':
    return True
    else:
    return False
    using the "gene_biotype" attribute of the GTF. In your case, giving the absence of this key, you need to adjust it.. An idea could be to replace turning both to True , meaning that the extension would be tried over all the possible entries.
  • Last, your GTF is not sorted, as it'd expected to be. I couldn't check the sorting at runtime so it's hard to raise a specific error, sorry for that. You could sort the input file by using

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' in.gtf > out_sorted.gtf

Ok. I fixed those problems but I'm getting the same error. Here is the new file:

BsChromosome1   AGAT    gene    4613    8271    .       +       .       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "jg22326"; gene_biotype "protein_coding";
BsChromosome1   gffutils_derived        transcript      4613    8271    .       +       .       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "jg22326.t1"; Parent "jg22326";
BsChromosome1   AUGUSTUS        exon    4613    4723    0.77    +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-1"; Parent "jg22326.t1"; cds_type "initial";
BsChromosome1   AUGUSTUS        exon    5597    5717    1       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-2"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    6653    6788    1       +       2       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-3"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    7612    7880    0.97    +       1       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-4"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    8174    8271    0.55    +       2       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-5"; Parent "jg22326.t1"; cds_type "terminal";
BsChromosome1   AUGUSTUS        intron  4724    5596    0.77    +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-1"; Parent "jg22326.t1"; supported "True";
BsChromosome1   AUGUSTUS        intron  5718    6652    1       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-2"; Parent "jg22326.t1"; supported "True";
BsChromosome1   AUGUSTUS        intron  6789    7611    1       +       2       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-3"; Parent "jg22326.t1"; supported "True";
BsChromosome1   AUGUSTUS        intron  7881    8173    0.97    +       1       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-4"; Parent "jg22326.t1"; supported "False";
Bschromosome1   AUGUSTUS        start_codon     4613    4615    .       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-start_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1   AUGUSTUS        stop_codon      8269    8271    .       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-stop_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1   AGAT    gene    42756   56893   .       -       .       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "jg22328"; gene_biotype "protein_coding";
BsChromosome1   gffutils_derived        transcript      42756   56893   .       -       .       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "jg22328.t1"; Parent "jg22328";
BsChromosome1   AUGUSTUS        exon    42756   42896   0.65    -       0       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-6"; Parent "jg22328.t1"; cds_type "terminal";
BsChromosome1   AUGUSTUS        exon    49721   49887   0.68    -       2       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-7"; Parent "jg22328.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    53912   54028   1       -       2       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-8"; Parent "jg22328.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    56884   56893   1       -       0       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-9"; Parent "jg22328.t1"; cds_type "initial";
BsChromosome1   AUGUSTUS        intron  42897   49720   0.65    -       0       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-5"; Parent "jg22328.t1"; supported "False";
BsChromosome1   AUGUSTUS        intron  49888   53911   0.68    -       2       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-6"; Parent "jg22328.t1"; supported "True";
BsChromosome1   AUGUSTUS        intron  54029   56883   1       -       2       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-7"; Parent "jg22328.t1"; supported "True";
BsChromosome1   AUGUSTUS        start_codon     56891   56893   .       -       0       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-start_codon-2"; Parent "jg22328.t1"; supported "False";
BsChromosome1   AUGUSTUS        stop_codon      42756   42758   .       -       0       gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-stop_codon-2"; Parent "jg22328.t1"; supported "False";
BsChromosome1   AGAT    gene    71952   107431  .       -       .       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "jg22330"; gene_biotype "protein_coding";
BsChromosome1   gffutils_derived        transcript      71952   107431  .       -       .       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "jg22330.t1"; Parent "jg22330";
BsChromosome1   AUGUSTUS        exon    71952   72215   0.75    -       0       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-10"; Parent "jg22330.t1"; cds_type "terminal";
BsChromosome1   AUGUSTUS        exon    77113   77212   1       -       1       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-11"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    80366   80593   1       -       1       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-12"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    81821   81917   1       -       2       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-13"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1   AUGUSTUS        exon    86028   86224   1       -       1       gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-14"; Parent "jg22330.t1"; cds_type "internal";

Any idea where the new issue is?

The provided input has typos and uncorrect number of spaces.

In the first start_codon entry

BsChromosome1   AUGUSTUS        intron  7881    8173    0.97    +       1       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-4"; Parent "jg22326.t1"; supported "False";
Bschromosome1   AUGUSTUS        start_codon     4613    4615    .       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-start_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1   AUGUSTUS        stop_codon      8269    8271    .       +       0       gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-stop_codon-1"; Parent "jg22326.t1"; supported "False";

The chromosome name is different (from BsChr to Bschr)
After fixing the typo and the usage of spaces to reformat into a valid tab separated file with

sed -i 's/ \{2,\}/\t/g' filename

it runs correctly on your input.
Let me know if you encountered any other errors or if I could proceed closing the issue.