airr-community/common-repo-wg

species ontology implementation in API

schristley opened this issue · 20 comments

Hey @bcorrie, @laserson

In the last CRWG, it was decided to use the NCBI taxonomy for the species, however we left off technical API discussion to be done outside of the meeting. So I’m creating this issue to kick that off.

The main issue revolves around, do we support:

  1. just a taxonomy id field in the API?
  2. just a species text field in the API (presumably with the id hidden)?
  3. both fields in the API?

I don't think (2) really makes sense, so it's between (1) and (3).

We should think about how UI's will handle this. With (1), I have these sort of questions:

  • How does the UI get the list of ids? Does the API provide them?
  • Where does it get the names that go with the ids?
  • How does the UI allow searching by name?

With (3) some of the same questions arise as with (1) but now need to handle query combos

  • The UI completely ignores the ids and uses just a plain text field.
  • If a query supplies both an id and a text field value, then do what?

My personal preference is currently for (1) but with the added twist that the data returned from the query has both the id and the text field species name.

Strongly in favor of (1). IMO dealing with matching it to a human-friendly name is definitely the responsibility of the person implementing the UI. I don't have much experience working with ontologies, but I'm assuming it's not that difficult to get the info.

In thinking of the implementation for this, I think we need to keep in mind what we will need to do for the "non-trivial" case where we don't have a single taxonomy that is definitive for a field like we do for species. Strain seems to be a good example, where we have some good mouse strain taxonomies defined but we know that researchers will need to extend this (they have a new strain that isn't in an existing taxonomy) and there may be more than one taxonomy (for different species). Also my understanding is that by extending a taxonomy we mean more than requesting a formal extension to an existing taxonomy, but instead will need the flexibility to "easily" add strains at a user level (correct me if I am wrong).

Given the need to be able to extend things easily, I would probably lean more towards #2 or #3. For #2 maybe a string or an array of strings with a taxonomy URL and taxonomy ID encoded with the string. For #3, two or three fields, the string and a combo field that identifies the taxonomy (a URL where you can discover the taxonomy) and the taxonomy ID that the string represents???

#2
species: [value:'Homo sapiens', taxonomy:'https://www.ncbi.nlm.nih.gov/taxonomy', taxonomy_id:'9606']

#3
species: "Homo sapiens"
species_taxonomy: "https://www.ncbi.nlm.nih.gov/taxonomy"
species_taxonomy_id: "9606"

#2 might be nice because the same approach could be used for any field that we "taxonomize" 8-)

@laserson I agree that is the simplest and most rigorous, but I also think is it the least flexible and puts a pretty significant burden on the consumer of the data... For someone to use an AIRR TSV file for metadata they would have to perform a significant data transformation (at least I think it is significant - see below) before being able to use the data... Is that what we want?

It seems to me that if we just have a taxonomy ID, then we would need the taxonomy field for the ID, the URL to the taxonomy, and ensure that the there is a mechanism to look up the taxonomy ID in the taxonomy to do the translation from ID to term (some sort of API for the taxonomy). These are the questions that Scott raised about #1. We would need such a translation for every taxonomy that we use. We should also note that no two taxonomies will have the same mechanism for looking up a taxonomy ID from the taxonomy to do this.

If you think of species and the NCBI taxonomy it is easy, but how do you handle things where a field value can take on a value from multiple different ontologies as well as when a field name can still take a value from outside any of these ontologies (which if I am not mistaken we have to consider - for strain for example).

I would lean towards something that is a hybrid such that consumers of the data will be able to do eyeball the metadata and do something with it.

I just had a quick look at what cedar does, and it handles it like this:

"Study Type": {
  "@id": "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C15197",
  "rdfs:label": "Case-Control Study"
},

This is what it has for species:

  "Organism": {
    "@type": "http://data.bioontology.org/ontologies/NCIT/classes/http%3A%2F%2Fncicb.nci.nih.gov%2Fxml%2Fowl%2FEVS%2FThesaurus.owl%23C70713",
    "@id": "http://purl.bioontology.org/ontology/NCBITAXON/9606",
    "rdfs:label": "Homo sapiens"
  },

This is something along the lines of what I was thinking about. The URL has the lookup for the ontology term and includes the ontology ID. I was also thinking that it would be worth having the ontology ID in a separate field - not sure that is necessary.

will need the flexibility to "easily" add strains at a user level

True, but the API is read-only so extension or adding terms is outside of its scope.

Your (2) is almost correct, it needs to be an object instead of an array though. I would tend to change the field names to be more generic

species: {
  name:'Homo sapiens',
  url:'https://www.ncbi.nlm.nih.gov/taxonomy',
  id:'9606'
}

I don't like your (3) because it breaks the cohesiveness of the "species object".

Queries could then be against species.id. The open question is do we require queries against species.name? One argument against is that essentially requires species.name to be stored in the database, and a repository might just prefer to store the id, then populate the name in the return data for a query. I'm not really opposed however, because I think species is going to be highly stable. I don't foresee having to rewrite the database because an id has become more specific and thus all the name fields need to be updated. I don't know how true this will be for other fields, especially ones were we know the ontologies are incomplete.

While the taxonomy url is correct in a generic sense, I wonder if we need it. Specifically because AIRR is picking those ontologies, they are going to already be pre-defined for all AIRR-compliant repositories. To avoid all the redundancy in the data, I would rather there be some mechanism that provides all the AIRR ontologies in some sort of singleton.

ontologies: {
  species: 'https://www.ncbi.nlm.nih.gov/taxonomy',
  strain: ['https://this', 'https://that']
}

We could, for example, put this ontology information into the AIRR Schema file. Or alternatively, add custom attributes into the schema for each field

        organism:
            type: string
            description: Binomial designation of subject's species
            x-ontology: https://www.ncbi.nlm.nih.gov/taxonomy
            x-miairr: true

will need the flexibility to "easily" add strains at a user level
True, but the API is read-only so extension or adding terms is outside of its scope.

Yes, but the API will have to handle situations where someone is using a term outside of the ontology and there is no ontology ID to use. In this case the URL and ID would be empty and there would only be a text term???

This contributes to the can of worms around queries... As before, the species case is easy, there is always an ontology ID. So we can choose one or the other. What do we do about strain, where the researcher has a 'Brian's mouse strain" strain that is used that isn't in an existing mouse strain ontology 8-)

It feels to me like we should allow for both searching on, and returning both name and ontology ID.

  1. The service that is performing the query can convert a string to an ontology ID if they store IDs and convert an ontology ID to a string if they store strings. The service then searchers on whichever makes sense for its repository implementation. If the service stores IDs only, and it receives a request for a string with no ontology, it can return nothing.

  2. The service can perform the inverse mapping as well, if the repository stores IDs they can generate the relevant string from the ID and if they store strings they can generate the ontology ID. If a service stores strings and can't find the string in the ontology, it returns a non-ontology term (no URL and ID).

  3. The consumer of the query response can choose to either use the ontology ID or the string depending on what is easiest. This minimizes the burden on the consumer.

  4. It is possible for both services and tools that are consuming the service responses to perform correctness testing if so desired to ensure that the string and the ontology ID match. In this case, if a service was strict, it could reject a query that did not have the ID and the string match (or it couldn't find the string provided in the ontology if only a string was provided), and the consumer could flag issues about the data from the service if they don't match...

Yes, but the API will have to handle situations where someone is using a term outside of the ontology and there is no ontology ID to use. In this case the URL and ID would be empty and there would only be a text term???

Yes, sorry, I was too brief in my comment. I agree, and we are on the same page.

It feels to me like we should allow for both searching on, and returning both name and ontology ID.

I agree. Based on the discussion it sounds like this will offer the greatest flexibility for the consumer without greatly burdening the API implementation.

How does the UI get the list of ids? Does the API provide them?

To answer my own question. One mechanism the UI could use is the facets parameter on a query to return the distinct set of values stored in the repository. This isn't all possible values in the ontology, just the distinct subset in the actual repository data. I believe iReceptor actually does this itself internally.

Where does it get the names that go with the ids?

I think, though I'm not positive, but the facets parameter could operate on both the id and the name, thus returning both of them, and the UI can use both as it wishes.

If a query supplies both an id and a text field value, then do what?

I think we don't do anything special, we treat it like any other query and if the id/value are in conflict, that really is a error on the consumer's part.

@schristley BTW, no objections to trying to consolidate ontology URLs in a single section, but I do worry about being able to specify how one uses a given URL to look up and return a specific text string in an ontology for a given ID. For example, from a software tool perspective, how do you translate 9606 to "Homo sapiens". How do you get some interpretable JSON from this lookup to extract "Homo sapiens". How far do we need to go to specify that?

http://bioportal.bioontology.org/ontologies/NCBITAXON?p=classes&conceptid=9606

Is it possible to use just the ontology URL (e.g., http://purl.bioontology.org/ontology/NCBITAXON/9606) in the string for species? We could in theory allow people use use other types of text there is the ontology doesn't have what they want. Or is that a terrible idea?

Also, this seems like a place where the API and the file format have slightly different needs. The API itself could essentially encode just 9606 or some variant of it (like the URL above) without being "user-friendly". It's an API...it doesn't need to be end-user friendly. It needs to be programmer-friendly. It'll be up to the people that implement the user interfaces to make it user-friendly, including looking up the associated names of the objects etc.

Is it possible to use just the ontology URL (e.g., http://purl.bioontology.org/ontology/NCBITAXON/9606) in the string for species? We could in theory allow people use use other types of text there is the ontology doesn't have what they want. Or is that a terrible idea?

Do you mean, not have the other fields, just have the one field? That's possible but I think it complicates other stuff because the value would need to be interpreted in different contexts.

Also, this seems like a place where the API and the file format have slightly different needs. The API itself could essentially encode just 9606 or some variant of it (like the URL above) without being "user-friendly".

True, which is why my initial preference was for (1) as well. But it cannot handle custom values that aren't in the ontology yet. There is no way to query studies with custom values if the API only accepts an id. I don't expect custom values for species, but I do expect that to happen for other fields.

For example, I found an AIRR-seq study where they designed their own mouse model where they turned off the normal mouse IGH locus, and inserted a "humanized" IGH to be expressed. An outlier for sure, but the study is in NCBI.

BTW, no objections to trying to consolidate ontology URLs in a single section, but I do worry about being able to specify how one uses a given URL to look up and return a specific text string in an ontology for a given ID.

Using only purl would make the usage consistent. If that's too restrictive, we could define a small set of acceptable URL syntaxes that covered what we needed for AIRR.

In my mind, those URLs are for provenance, not for analysis and/or UI. If my analysis tool needed to perform a URL request for each record in order to get a value, that's a no go.

From a UI design perspective, you might be tempted to do URL requests as well, but that could really hurt responsiveness and scalability. You pretty much need to cache the ontology locally in some way.

Do you mean, not have the other fields, just have the one field? That's possible but I think it complicates other stuff because the value would need to be interpreted in different contexts.

I'm not sure what you mean by it having to be interpreted in different contexts.

But it cannot handle custom values that aren't in the ontology yet.

Where will the values be enforced? In principle it's just a string field, which means that every implementation will have to enforce that it gets a valid purl. One gross way to be flexible is to allow users to put any string there, but ask that they use the correct purl if it applies. Maybe this would lead to too much bad behavior. My worry is that if there is both a url field and a free text field, people will just be lazy and use the free text only.

In my mind, those URLs are for provenance, not for analysis and/or UI. If my analysis tool needed to perform a URL request for each record in order to get a value, that's a no go.

I don't think this is a realistic concern. I would guess that any UI working on this data would pull all the relevant ontology values it needs and cache them.

I'm not sure what you mean by it having to be interpreted in different contexts.

I'm thinking if the field could have name in it, i.e. 'Homo sapiens', or the url,
http://purl.bioontology.org/ontology/NCBITAXON/9606, then any app has to interpret the field to figure out which is which. A UI wanting to display the human-readable name cannot just display the text field, it has to interpret and then potentially do a URL request to get the name. I'm not a fan of this dual-context usage, I'd prefer having them be separate fields. Makes it easier on the consumer.

Where will the values be enforced?

That's somewhat outside the scope of the REST API, as we are defining a read-only API. However, enforcement likely needs to occur during data entry, e.g. using CEDAR to record the study metadata, and repositories themselves probably want to perform some enforcement when loading data into their databases.

In terms of AIRR compliance, we could define some validation tests to insure that repositories are returning data that meets some criteria we define.

My worry is that if there is both a url field and a free text field, people will just be lazy and use the free text only.

I agree but I'm not sure if you are referring to the API or to data entry like in CEDAR? For the API, I'm fine with people being lazy and querying on the text field. We cannot apply any enforcement on data entry because we have no control over that. We can enforce what the API returns however, so we could say something like:

The last one is the hard one though. If some repository loads up data with no id and name 'Human', how do we flag that as incorrect versus 'Homo habilis' which is a valid proposal species but doesn't exist in the NCBI taxonomy.

FYI - Christian has made some changes in the MiAIRR side of the world to be more explicit in the use of ontologies and controlled vocabularies - see here:

airr-community/airr-standards@c47843c

An example:

1 / subject Organism string {"ontology": "NCBITAXON", "top node": "Gnathostomata", "draft": false} Species of subject (using binomial nomenclature)

Yeah, I was looking at this the other day, and thinking how that should be represented in the schema. Seems we need to resolve this sooner rather than later because the schema and TSV have to match for his PR to be merged.

@schristley: I now updated the ontology/CV JSON descriptions also in specs/miairr.yaml. However, I also just realized that our test scripts currently do not include this file in the validity checks.

I do not want to forestall any decisions on ontologies with this PR, thus NCIT and CL usage are marked as draft. The use of NCBITAXON for organism should be non-controversial as the strings should be identical no matter which ontology we use (i.e. can be changed later on).

The current description is what is used by CAIRR, which treats ontologies as large controlled vocabularies (i.e. it uses the strings not the IDs).

Thanks @bussec , we do need to incorporate miairr.yaml more directly, e.g. that attribute data should be available through the reference library somehow, and yes we should add some validity checks. We can work on this over time.

This has been implemented in the repertoire metadata and ADC API