malariagen/malariagen-data-python

Different nomenclature in genome_features between Ag and Af

Closed this issue · 4 comments

Genome features for Ag come from VectorBase and the ones for Af come from VEuPathDB and the two databases use slightly different nomenclatures for the genome features. Inn particular, VectorBase has a feature called 'gene' while VEuPathDB doesn't: it uses 'protein_coding_gene' instead. This is significative because (among other things), the function _gene_cnv (which is part of anopheles.py and thus shared between Ag and Af) looks for 'gene' features and can't find any in the genome feature data frame. Hence, it fails completely.

Thanks Jon. Let me know if you'd like to have a go at fixing.

I'm going to give it a try.

Cool thanks.

Btw there already is an attribute available ._gff_gene_type which is set to the correct value for Ag3 ("gene") and Af1 ("protein_coding_gene").

So within the _gene_cnv() method in the AnophelesDataResource class, it should be possible to replace:

df_genes = df_genome_features.query("type == 'gene'")

...with something like:

df_genes = df_genome_features.query(f"type == '{self._gff_gene_type}'")

My long term solution was to create such an attribute. Glad it already exists!