draeger-lab/ModelPolisher

Annotating models with the genome identifier

Opened this issue · 3 comments

@Midnighter requests at SBRG/bigg_models#368:

Many models in BiGG are currently annotated with a taxonomic identifier and a reference to the model itself, for example, as shown below.

    <annotation>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bqmodel="http://biomodels.net/model-qualifiers/" xmlns:bqbiol="http://biomodels.net/biology-qualifiers/">
        <rdf:Description rdf:about="#iML1515">
          <bqbiol:hasTaxon>
            <rdf:Bag>
              <rdf:li rdf:resource="http://identifiers.org/taxonomy/511145" />
            </rdf:Bag>
          </bqbiol:hasTaxon>
          <bqmodel:is>
            <rdf:Bag>
              <rdf:li rdf:resource="http://identifiers.org/bigg.model/iML1515" />
            </rdf:Bag>
          </bqmodel:is>
        </rdf:Description>
      </rdf:RDF>
    </annotation>

On the website, BiGG also provides a link to the genome sequence that was used to create the model, see, for example, http://bigg.ucsd.edu/models/iML1515.

Where possible, it would be great to also create MIRIAM compliant annotations of the genome on the model using the identifier from the genome database or RefSeq namespaces as defined at Identifiers.org.

Is this a task for ModelPolisher?

There is a ncbi_assembly id column in the genome table of BiGG DB, however, it appears to be empty.

Additionally there are accession_type and accession_value columns, where accession_type is currently one of ncbi_accession or ncbi_assembly.
BiGG resolves the Genome link to a list of models and chromosomes, where the ncbi_accessions can be directly resolved as ids appended to https://www.ncbi.nlm.nih.gov/nuccore/. The ncbi_assembly entries are resolved to a list of chromosomes, however I don't know how this is done exactly.

From what I've gathered from BiGG, neither these accessions nor the taxon ids appear in any other place, so retrieving RefSeq annotations would likely require to fetch the corresponding entry from GenBank.

Had another look at the data BiGG provides and this is actually easy to do, albeit with some issues regarding the MIRIAM compliance. Must have been half asleep when looking at the issue last time...

All accession starting with NC_ or NZ_ can be converted to MIRIAM compliant URIs.
All GCF_ entries should fit the genome assembly database pattern, there just seems to be a problem regarding resolution.
If used as id in https://identifiers.org/insdc.gca:{$id}, this is resolved to https://www.ebi.ac.uk/ena/data/view/{$id}, where no entry is available for the id.
Using the ncbi resource, however, the id can be resolved correctly, so for now we could create a non MIRIAM annotation this way.
it might be worth to inquire about that issue, as it contradicts my understanding of how the resolution process works, if the given resources have different resolution capabilities.
All other accesions could be added as non MIRIAM annotation the way done on the BiGG Models website, i.e. https://www.ncbi.nlm.nih.gov/nuccore/{$id}.

Do we want to add just the MIRIAM compliant annotations or all of them?

Edit: Just realized we have a INCLUDE_ANY_URI flag we could use here.
What is the appropriate qualifier for these annotations, BQB_IS_VERSION_OF?

Implemented as described above in 2.1. branch.
Leaving open to discuss the correct qualifier.