A Perl Script to sort gff3 files and produce suitable results for tabix tools
gff3sort.pl [input GFF3 file] >output.sort.gff3
Optional Parameters:
--precise Run in precise mode, about 2X~3X slower than the default mode.
Only needed to be used if your original GFF3 files have parent
features appearing behind their children features.
--chr_order Select how the chromosome IDs should be sorted.
Acceptable values are: alphabet, natural, original
[Default: alphabet]
--extract_FASTA If the input GFF3 file contains FASTA sequence at the end, use this
option to extract the FASTA sequence and place in a separate file
with the extention '.fasta'. By default, the FASTA sequences would be
discarded.
Zhu T, Liang C, Meng Z, Guo S, Zhang R: GFF3sort: A novel tool to sort GFF3 files for tabix indexing. BMC Bioinformatics 2017, 18:482, https://doi.org/10.1186/s12859-017-1930-3
The tabix tool from htslib requires files sorted by their chromosomes and positions. For GFF3 files, they would be sorted by column 1 (chromosomes) and 4 (start positions) as:
sort -k1,1 -k4,4n myfile.gff > myfile.sorted.gff
(OR)
gt gff3 -sortlines -tidy -retainids myfile.gff > myfile.sorted.gff
Then, the sorted GFF3 file could be indexed by:
bgzip myfile.sorted.gff
tabix -p gff myfile.sorted.gff.gz
However, either the GNU sort or the gt tool has a bug: Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines. For example, the following features:
##gff-version 3
###
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41
###
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
would be sorted as:
##gff-version 3
##sequence-region A01 473 6386
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41
###
That is, the two mRNA
lines start with pos 473
would be "randomly" placed after the two exon
lines which also start with pos 473
. These would encount bugs such as GMOD/jbrowse#780
This script would adjust lines with the same start positions. It would move lines with "Parent="
attributes (case insensitive) behind lines without "Parent="
attributes. The result would be:
A01 Cufflinks mRNA 473 6386 . - . ID=XLOC_001154.42;description=Novel: Intergenic transcript
A01 Cufflinks mRNA 473 6154 . - . ID=XLOC_001154.41;description=Novel: Intergenic transcript
A01 Cufflinks exon 473 814 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 473 2024 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 1626 2574 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 2615 2721 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 2695 2721 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 3637 3726 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5329 5408 . - . Parent=XLOC_001154.41
A01 Cufflinks exon 5329 6386 . - . Parent=XLOC_001154.42
A01 Cufflinks exon 5994 6154 . - . Parent=XLOC_001154.41