Add command to expand xrefs section in GFF3 files
Opened this issue · 2 comments
I am 97% sure this is out of scope for sssom-py and this should be either it's own tool or something as part of a general gff package. But this seems like a good place to seed the idea.
GFF allows various kinds of annotations in column 9, many of these are CURIEs. It's often useful to expand these. E.g. a gene annotated with an EC by prokka could be expanded to a GO annotation using a GO sssom file.
Can you link to an example of a GFF file please?
Here is the first few lines of the output of prokka run on a metagenomic sample (downloaded from here in NMDC).
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 18 563 72.59 - 0 ID=Ga0495479_0000001_18_563;translation_table=11;start_type=ATG;product=5-methylcytosine-specific restriction endonuclease McrA;product_source=COG1403;cog=COG1403;pfam=PF14279
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 692 1357 88.17 - 0 ID=Ga0495479_0000001_692_1357;translation_table=11;start_type=ATG;product=phospholipase/carboxylesterase;product_source=KO:K06999;cath_funfam=3.40.50.1820;cog=COG0400;ko=KO:K06999;pfam=PF02230;superfamily=53474
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 1415 2068 95.20 - 0 ID=Ga0495479_0000001_1415_2068;translation_table=11;start_type=ATG;product=DNA-3-methyladenine glycosylase II;product_source=KO:K01247;cath_funfam=1.10.1670.10,1.10.340.30;cog=COG0122;ko=KO:K01247;ec_number=EC:3.2.2.21;pfam=PF00730;smart=SM00478;superfamily=48150
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 2223 3116 110.08 + 0 ID=Ga0495479_0000001_2223_3116;translation_table=11;start_type=ATG;product=glutamyl-Q tRNA(Asp) synthetase;product_source=KO:K01894;cath_funfam=3.40.50.620;cog=COG0008;ko=KO:K01894;ec_number=EC:6.1.1.-;pfam=PF00749;superfamily=52374
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 3293 4492 183.16 + 0 ID=Ga0495479_0000001_3293_4492;translation_table=11;start_type=ATG;product=CheY-like chemotaxis protein;product_source=COG0784;cath_funfam=1.10.287.130,3.30.565.10,3.40.50.2300;cog=COG0784;pfam=PF00072,PF00512,PF02518;smart=SM00387,SM00388,SM00448;superfamily=47384,55874
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 4632 6602 342.80 - 0 ID=Ga0495479_0000001_4632_6602;translation_table=11;start_type=ATG;product=(2R)-ethylmalonyl-CoA mutase;product_source=KO:K14447;cath_funfam=3.20.20.240,3.40.50.280;cog=COG1884,COG2185;ko=KO:K14447;pfam=PF01642,PF02310;superfamily=51703,52242;tigrfam=TIGR00640,TIGR00641
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 6630 6881 34.32 - 0 ID=Ga0495479_0000001_6630_6881;translation_table=11;start_type=ATG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226
Ga0495479_0000001 GeneMark.hmm-2 v1.05 CDS 7044 7304 41.09 - 0 ID=Ga0495479_0000001_7044_7304;translation_table=11;start_type=GTG;product=uncharacterized membrane protein YeaQ/YmgE (transglycosylase-associated protein family);product_source=COG2261;cog=COG2261;pfam=PF04226
GFF doesn't have a particularly formal way of ensuring identifiers are unambiguous. In some flavours of GFF you will see bona fide CURIEs, sometimes it's somewhat implicit from the key (e.g. cog, pfam, ec_number, ...). See this preprint for recommendations on improving this situation.
Now I look at the prokka file again I see that it's not even using the recommended Ontology_term
attribute, so this is looking more like some kind of bespoke gff tool that takes into account multiple idiosyncracies, definitely outside sssom-py