NAL-i5K/GFF3toolkit

SeqID does not end with a number.

tbazilegith opened this issue · 6 comments

Hello,
I ran gff3_sort using the command below and got the error that follows
gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3

ERROR [SeqID] SeqID does not end with a number.

  • Line 6: 1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme
    Adding argument -r like " gff3_sort -g example_file/example.gff3 -og example-sorted.gff3 -r " can handle this situation.

I went ahead and added the flag -r
gff3_sort --gff_file mysample_results20220802/annot.gff --output_gff mysample_sort.gff3 -r
But I got this

Traceback (most recent call last):
File "/apps/gff3toolkit/2.0.3/bin/gff3_sort", line 8, in
sys.exit(script_main())
File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 437, in script_main
main(args.gff_file, output=args.output_gff, isoform_sort=args.isoform_sort, sorting_order=sorting_order, logger=logger_stderr, reference=args.reference)
File "/apps/gff3toolkit/2.0.3/lib/python3.9/site-packages/gff3tool/bin/gff3_sort.py", line 223, in main
sequence_regions[sequence_region['seqid']] = (sequence_region['start'], sequence_region['end'])
KeyError: 'end'

It seems to me that the above "Line 6" must be skipped in the file annot.gff

Any thought on that?
Thanks,
TJ

@tbazilegith this error looks similar to the one reported in #125. Can you post some examples of the sequence directive lines? Do they all have a number as the end coordinate?

Hello MPoelchau,
Here is what I have
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region 1 3396752
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1386
1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme
1 . pseudogene 1 144 . - . ID=gene-tmp_000001;Name=tmp_000001;gbkey=Gene;gene_biotype=pseudogene;locus_tag=tmp_000001;pseudo=true

Thanks,
TJ

Hello MPoelchau,
Here is the full header
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
##sequence-region 1 3396752
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=1386
1 Local region 1 3396752 . + . ID=1:1..3396752;Dbxref=taxon:1386;Is_circular=true;Name=ANONYMOUS;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=replaceme
1 . pseudogene 1 144 . - . ID=gene-tmp_000001;Name=tmp_000001;gbkey=Gene;gene_biotype=pseudogene;locus_tag=tmp_000001;pseudo=true
Thanks,
TJ

@tbazilegith looks like the sequence region directive is missing a '1' (representing either the chromosome or the start coordinate). The format is ##sequence-region seqid start end. So it should instead be
##sequence-region 1 1 3396752

@tbazilegith just following up, did fixing the sequence region directive work for you?

I'll close this issue but feel free to re-open if that didn't help.