Issue with gtf_advanced_parser.py and Ensembl GTF
chvandijk opened this issue · 7 comments
Hi,
I would like to use this script. However, when I want to run the advanced gtf parser script on the esembl gtf (Homo_sapiens.GRCh38.98.gtf) I get the following error:
python3 gtf_advanced_parser.py --input $gtf --output $out
Traceback (most recent call last):
File "gtf_advanced_parser.py", line 102, in fill_missing_utrs
utr_three=three_prime_utrs_records[transcript_id]
KeyError: '"ENST00000478729"'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "gtf_advanced_parser.py", line 194, in <module>
fill_missing_utrs(gene_id_records)
File "gtf_advanced_parser.py", line 106, in fill_missing_utrs
last_cds= GTF_record(*cdss_records[transcript_id][-1].split('\t'))
KeyError: '"ENST00000478729"'
This transcripts happens to be a lncRNA, maybe this has to do something with it? This is what the lines of this transcript look like in the gtf:
1 havana transcript 940346 942173 . + . gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; transcript_support_level "5";
1 havana exon 940346 940462 . + . gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "1"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00001953327"; exon_version "1"; transcript_support_level "5";
1 havana exon 941144 941306 . + . gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "2"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003686968"; exon_version "1"; transcript_support_level "5";
1 havana exon 942136 942173 . + . gene_id "ENSG00000187634"; gene_version "12"; transcript_id "ENST00000478729"; transcript_version "1"; exon_number "3"; gene_name "SAMD11"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SAMD11-209"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00001881351"; exon_version "1"; transcript_support_level "5";
Hi @chvandijk , sorry for such a long delay. I was aware of errors like yours' and planned long time ago to fix them, but had literally no time. I've completely rewritten the code with the last commit (7340123) fixing known errors like this. Feel free to reopen the issue if you spot any other problems.
Hello @danilotat. I'm getting a similar error but it isn't clear what the problem is; it's starting from the very beginning. Here is the command and error:
client.py --input braker_annotations_anysupport.names.exonUTRs_10Xdata.cleaned.gtf --output braker_annotations_anysupport.names.exonUTRs_10Xdata.cleaned.extended.gtf --length 1000 --min_dist 20 --logs extension.log
Processing chromosome BsChromosome1
Traceback (most recent call last):
File "/home/jgoodheart/mendel-nas1/programs/UTR_add_extend_GTF/client.py", line 18, in
chromosome_wise(args.input, args.output, args.length, int(args.min_dist), args.logs)
File "/mendel-nas1/jgoodheart/programs/UTR_add_extend_GTF/gtftools/workers.py", line 343, in chromosome_wise
other_entries(entry, transcripts)
File "/mendel-nas1/jgoodheart/programs/UTR_add_extend_GTF/gtftools/workers.py", line 36, in other_entries
transcript_collection[transcript_id]._add_exon(entry)
KeyError: 'jg22355.t1'
And this is what the gtf file looks like:
BsChromosome1 AUGUSTUS exon 613869 613957 0.84 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "initial"; BsChromosome1 AUGUSTUS exon 616908 616923 0.95 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 624805 624846 0.95 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 625694 625780 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 626621 626679 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 628091 628258 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 629261 629363 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 635892 636044 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 636422 636508 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal"; BsChromosome1 AUGUSTUS exon 638788 638910 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
Hi @goodgodric28 , is the gtf only composed of exons (referring to the 3rd column)? If not, please attach a more comprehensive example spanning multiple genes.
BsChromosome1 AUGUSTUS exon 613869 613957 0.84 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "initial";
BsChromosome1 AUGUSTUS exon 616908 616923 0.95 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 624805 624846 0.95 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 625694 625780 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 626621 626679 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 628091 628258 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 629261 629363 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 635892 636044 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 636422 636508 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 638788 638910 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 640263 640298 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; cds_type "terminal";
BsChromosome1 peaks2utr exon 640299 644258 . + . transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1 AUGUSTUS intron 616924 624804 0.95 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 624847 625693 0.95 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 625781 626620 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 626680 628090 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 628259 629260 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 629364 635891 1 + 1 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 636045 636421 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 636509 638787 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 638911 640262 1 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "True";
BsChromosome1 AUGUSTUS intron 613958 616907 0.84 + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1 gffutils_derived transcript 613869 644500 . + . transcript_id "jg22355.t1"; gene_id "jg22355,jg22355.t1";
BsChromosome1 AUGUSTUS start_codon 613869 613871 . + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1 AUGUSTUS stop_codon 640296 640298 . + 0 transcript_id "jg22355.t1"; gene_id "jg22355"; supported "False";
BsChromosome1 peaks2utr three_prime_UTR 640299 644258 . + . transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1 peaks2utr exon 640299 644258 . + . transcript_id "jg22355.t1"; gene_id "jg22355"; colour "3";
BsChromosome1 peaks2utr three_prime_UTR 644259 644500 . + . gene_id "jg22355"; transcript_id "jg22355.t1"; colour "4";
BsChromosome1 peaks2utr exon 644259 644500 . + . gene_id "jg22355"; transcript_id "jg22355.t1"; colour "4";
BsChromosome1 AUGUSTUS exon 741527 741652 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "initial";
BsChromosome1 AUGUSTUS exon 743281 743392 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 744558 744601 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 744870 744940 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 745652 745701 1 + 1 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 746565 746638 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 747605 747655 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 749195 749330 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 749684 749778 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 751239 751418 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; cds_type "terminal";
BsChromosome1 AUGUSTUS intron 741653 743280 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 743393 744557 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 744602 744869 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 744941 745651 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 745702 746564 1 + 1 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 746639 747604 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 747656 749194 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 749331 749683 1 + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 AUGUSTUS intron 749779 751238 1 + 2 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "True";
BsChromosome1 gffutils_derived transcript 741527 751634 . + . transcript_id "jg22362.t1"; gene_id "jg22362.t1,jg22362";
BsChromosome1 AUGUSTUS start_codon 741527 741529 . + 0 transcript_id "jg22362.t1"; gene_id "jg22362"; supported "False";
Spotted some issues:
- The provided example lacks of entries regarding
genes
, as expected in a standard Ensembl GTF file. This is key because the tool works by creating a hierarchical representation of the GTF made bygenes
-transcripts
-features
. Is this expected? - By default the extension is computed only on protein_coding genes, to resemble the biological meaning of an untranslated region. This tag is extrapolated here
UTR_add_extend_GTF/gtftools/objects.py
Lines 86 to 89 in 7340123
True
, meaning that the extension would be tried over all the possible entries. - Last, your GTF is not sorted, as it'd expected to be. I couldn't check the sorting at runtime so it's hard to raise a specific error, sorry for that. You could sort the input file by using
awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' in.gtf > out_sorted.gtf
Ok. I fixed those problems but I'm getting the same error. Here is the new file:
BsChromosome1 AGAT gene 4613 8271 . + . gene_id "jg22326"; transcript_id "jg22326.t1"; ID "jg22326"; gene_biotype "protein_coding";
BsChromosome1 gffutils_derived transcript 4613 8271 . + . gene_id "jg22326"; transcript_id "jg22326.t1"; ID "jg22326.t1"; Parent "jg22326";
BsChromosome1 AUGUSTUS exon 4613 4723 0.77 + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-1"; Parent "jg22326.t1"; cds_type "initial";
BsChromosome1 AUGUSTUS exon 5597 5717 1 + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-2"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 6653 6788 1 + 2 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-3"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 7612 7880 0.97 + 1 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-4"; Parent "jg22326.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 8174 8271 0.55 + 2 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-exon-5"; Parent "jg22326.t1"; cds_type "terminal";
BsChromosome1 AUGUSTUS intron 4724 5596 0.77 + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-1"; Parent "jg22326.t1"; supported "True";
BsChromosome1 AUGUSTUS intron 5718 6652 1 + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-2"; Parent "jg22326.t1"; supported "True";
BsChromosome1 AUGUSTUS intron 6789 7611 1 + 2 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-3"; Parent "jg22326.t1"; supported "True";
BsChromosome1 AUGUSTUS intron 7881 8173 0.97 + 1 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-4"; Parent "jg22326.t1"; supported "False";
Bschromosome1 AUGUSTUS start_codon 4613 4615 . + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-start_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1 AUGUSTUS stop_codon 8269 8271 . + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-stop_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1 AGAT gene 42756 56893 . - . gene_id "jg22328"; transcript_id "jg22328.t1"; ID "jg22328"; gene_biotype "protein_coding";
BsChromosome1 gffutils_derived transcript 42756 56893 . - . gene_id "jg22328"; transcript_id "jg22328.t1"; ID "jg22328.t1"; Parent "jg22328";
BsChromosome1 AUGUSTUS exon 42756 42896 0.65 - 0 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-6"; Parent "jg22328.t1"; cds_type "terminal";
BsChromosome1 AUGUSTUS exon 49721 49887 0.68 - 2 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-7"; Parent "jg22328.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 53912 54028 1 - 2 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-8"; Parent "jg22328.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 56884 56893 1 - 0 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-exon-9"; Parent "jg22328.t1"; cds_type "initial";
BsChromosome1 AUGUSTUS intron 42897 49720 0.65 - 0 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-5"; Parent "jg22328.t1"; supported "False";
BsChromosome1 AUGUSTUS intron 49888 53911 0.68 - 2 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-6"; Parent "jg22328.t1"; supported "True";
BsChromosome1 AUGUSTUS intron 54029 56883 1 - 2 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-intron-7"; Parent "jg22328.t1"; supported "True";
BsChromosome1 AUGUSTUS start_codon 56891 56893 . - 0 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-start_codon-2"; Parent "jg22328.t1"; supported "False";
BsChromosome1 AUGUSTUS stop_codon 42756 42758 . - 0 gene_id "jg22328"; transcript_id "jg22328.t1"; ID "agat-stop_codon-2"; Parent "jg22328.t1"; supported "False";
BsChromosome1 AGAT gene 71952 107431 . - . gene_id "jg22330"; transcript_id "jg22330.t1"; ID "jg22330"; gene_biotype "protein_coding";
BsChromosome1 gffutils_derived transcript 71952 107431 . - . gene_id "jg22330"; transcript_id "jg22330.t1"; ID "jg22330.t1"; Parent "jg22330";
BsChromosome1 AUGUSTUS exon 71952 72215 0.75 - 0 gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-10"; Parent "jg22330.t1"; cds_type "terminal";
BsChromosome1 AUGUSTUS exon 77113 77212 1 - 1 gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-11"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 80366 80593 1 - 1 gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-12"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 81821 81917 1 - 2 gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-13"; Parent "jg22330.t1"; cds_type "internal";
BsChromosome1 AUGUSTUS exon 86028 86224 1 - 1 gene_id "jg22330"; transcript_id "jg22330.t1"; ID "agat-exon-14"; Parent "jg22330.t1"; cds_type "internal";
Any idea where the new issue is?
The provided input has typos and uncorrect number of spaces.
In the first start_codon
entry
BsChromosome1 AUGUSTUS intron 7881 8173 0.97 + 1 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-intron-4"; Parent "jg22326.t1"; supported "False";
Bschromosome1 AUGUSTUS start_codon 4613 4615 . + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-start_codon-1"; Parent "jg22326.t1"; supported "False";
BsChromosome1 AUGUSTUS stop_codon 8269 8271 . + 0 gene_id "jg22326"; transcript_id "jg22326.t1"; ID "agat-stop_codon-1"; Parent "jg22326.t1"; supported "False";
The chromosome name is different (from BsChr
to Bschr
)
After fixing the typo and the usage of spaces to reformat into a valid tab separated file with
sed -i 's/ \{2,\}/\t/g' filename
it runs correctly on your input.
Let me know if you encountered any other errors or if I could proceed closing the issue.