salzman-lab/SICILIAN

Index Error when trying to run create_annotator.py

Opened this issue · 4 comments

Hi! I'm trying to get the annotation files using hg38 Ensembl GTF file.

I'm running it this way:

python3 create_annotator.py -g Homo_sapiens.GRCh38.97.gtf -a hg38_ensembl

And I'm getting this error message:

sys:1: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
dumped exon boundary annotator to /share/ScratchGeneral/gabrod/GTF/hg38_ensembl_exon_bounds.pkl
Traceback (most recent call last):
  File "/home/gabrod/bin/SICILIAN/scripts/create_annotator.py", line 74, in <module>
    main()
  File "/home/gabrod/bin/SICILIAN/scripts/create_annotator.py", line 67, in main
    splices = get_splices(gtf_df)
  File "/home/gabrod/bin/SICILIAN/scripts/create_annotator.py", line 55, in get_splices
    splices[name1].add(tuple(sorted([group2[group2["exon_number"] == i].iloc[0]["end"],group2[group2["exon_number"] == i + 1].iloc[0]["start"]])))
  File "/share/ClusterShare/biodata/contrib/gabrod/anaconda2/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py", line 879, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/share/ClusterShare/biodata/contrib/gabrod/anaconda2/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py", line 1496, in _getitem_axis
    self._validate_integer(key, axis)
  File "/share/ClusterShare/biodata/contrib/gabrod/anaconda2/envs/py39/lib/python3.9/site-packages/pandas/core/indexing.py", line 1437, in _validate_integer
    raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds

I'm guessing the splices = get_splices(gtf_df) is not in the appropriate format and that is why I'm getting the IndexError. How can I fix this?

Could you share a link to the gtf file you're using (or the first five lines or so of the gtf)?

Hi! sorry for my late reply.

Here is the first lines of the GTF

#!genome-build GRCh38.p12 #!genome-version GRCh38 #!genome-date 2013-12 #!genome-build-accession NCBI:GCA_000001405.27 #!genebuild-last-updated 2019-03 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

That GTF file was downloaded from Ensembl. I was able to run create_annotator.py using the GTF downloaded from UCSC.
Thanks!

Hi! sorry for my late reply.

Here is the first lines of the GTF

#!genome-build GRCh38.p12 #!genome-version GRCh38 #!genome-date 2013-12 #!genome-build-accession NCBI:GCA_000001405.27 #!genebuild-last-updated 2019-03 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

That GTF file was downloaded from Ensembl. I was able to run create_annotator.py using the GTF downloaded from UCSC. Thanks!

Hi, Have you solved this problem? i meet a same error for my GTF file.thanks a lot !

Hi! sorry for my late reply.
Here is the first lines of the GTF
#!genome-build GRCh38.p12 #!genome-version GRCh38 #!genome-date 2013-12 #!genome-build-accession NCBI:GCA_000001405.27 #!genebuild-last-updated 2019-03 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; tag "basic"; transcript_support_level "1"; 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1"; 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "lncRNA"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
That GTF file was downloaded from Ensembl. I was able to run create_annotator.py using the GTF downloaded from UCSC. Thanks!

Hi, Have you solved this problem? i meet a same error for my GTF file.thanks a lot !

Here is the first five lines of my GTF file:

2 Gnomon gene 6661 7133 . - . gene_id "Dpse_Denovo_1.1"; gene "LOC117183278"; gene_type "lncRNA";
2 Gnomon transcript 6661 7133 . - . gene_id "Dpse_Denovo_1.1"; transcript_id "Dpse_Denovo_1.1.1"; gene "LOC117183278";
2 Gnomon exon 6953 7133 . - . gene_id "Dpse_Denovo_1.1"; transcript_id "Dpse_Denovo_1.1.1"; gene "LOC117183278"; exon_number "1";
2 Gnomon exon 6841 6904 . - . gene_id "Dpse_Denovo_1.1"; transcript_id "Dpse_Denovo_1.1.1"; gene "LOC117183278"; exon_number "2";