monarch-initiative/biolink-api

Add general api endpoint for mapping taxon labels to ids?

Opened this issue · 6 comments

In a similar fashion to #386 , the associations endpoints, for example this one, returns taxon facets in terms of labels, but takes taxons facets in terms of ids.

Instead of adding a _taxon_map facet to all these different endpoints like in biolink/ontobio#618, perhaps it's better to create a generic endpoint that allows me to map an array of taxon labels to taxon ids in a single query. (I'm sure there's an existing endpoint to map a single label to id, but I'd need to hit that possibly dozens of times at once).

I suppose this would also allow us to undo the work in #386 for cleanliness, if that's what you want to do.

I suppose it'd the opposite of the /ontol/labeler/ endpoint, which for a query like https://api.monarchinitiative.org/api/ontol/labeler/?id=NCBITaxon%3A10090&id=ECO%3A0000304&id=NCBITaxon%3A10090 returns this?:

{
  "NCBITaxon:10090": "Mus musculus",
  "ECO:0000304": "author statement supported by traceable reference used in manual assertion"
}

That endpoint is currently GET-only, but I could also have a POST endpoint so that you can submit a larger amount of text to it. I assume the POST version would take a list of strings and return a label->ID dict. Maybe we could call it /ontol/identifier/ to suggest that it returns IDs for labels...?

Also, sure, I can remove the _taxon_map subkey, but if it's something that you're going to query for anyway we pay a lower cost inlining it in the response than we would dealing with a second query.

Right, it seems like it would be the opposite of that endpoint. And yeah I'd probably need a POST endpoint, because the amount of text could be quite long. So far I've seen cases where I'd need to submit a dozen taxon labels, but I don't know if that's the true max.

I assume the POST version would take a list of strings and return a label->ID dict

The mapping could go the other way too, as long as all the info is there.

Maybe we could call it /ontol/identifier/ to suggest that it returns IDs for labels...?

I like the clarity of that name. Another option could be to just keep the same /onto/labeler name, and have the POST version go in the opposite direction, with a defined labels: Array<string> input param in the body. Just throwing out another option here, not sure which one is better/cleaner/clearer.

I can remove the _taxon_map subkey

Regarding this, that's your call. But I'd say that if you implement this new endpoint, I'd use it everywhere. I wouldn't want to have extra code that uses _taxon_map in one place, and /onto/labeler in another (for the same exact purpose).

I prefer to have a separate endpoint, even though I agree that they're pretty complementary functions, so we're not changing too much of the existing interface.

On a side note, while looking at the /ontol/labeler/ code, there are a few issues:

  1. the endpoint simply throws a 500 if you give it an identifier with no match
  2. if point 1 was resolved, it contains code to still omit IDs with no matching label from the response, which IMHO is more confusing than returning null for non-matches

I ask because it raises a few questions for the new endpoint:

  1. Would you prefer to get null back for labels that have no match, or would you prefer them to just be absent?
  2. Do you want to do full matches or partial matches to the label field?
  3. Do you want to return multiple IDs that match the label or just the first one? I'm not familiar with the contents of the database, so I'm not sure if that's actually something that might happen, but I imagine it would if you opted for partial matches for point 2.

FWIW, the _taxon_map code isn't a duplicate of this new endpoint -- it adds a pivot facet to the single query to the solr backend, whereas the /ontol/* endpoints issue one query to its backend per ID or label. While it's still more efficient for the backend to talk to the databases directly over a series of queries (vs. having each query be in an API request), it's even more efficient for the /search/entity/{term} endpoint to get the taxon ID/label map in a single database query. Anyway, long story short if you're going to do the /ontol/identifier/ query anyway I can drop the _taxon_map subkey, but if you can use that subkey for that specific query instead of hitting /ontol/identifier/ that would be a better choice. (I imagine /search/entity/{term} will be heavily used, so perhaps it makes sense to make an exception in that specific case.)

Regarding your 3 questions. I dont think it matters to me whether it's null or absent. I believe what I'd need here is exact (full?) matches. I'm not sure either if there would be multiple matches in the case of taxons, I don't know the database either. Other people might want more flexibility though, so maybe it should be all of the IDs that match, and I can easily just pick the first one.

Maybe we need someone from TISLab to weigh in, but I'm pretty sure if you give me an exact (but perhaps case insensitive) match, the results will be as expected.

Regarding the _taxon_map. Unless it's massively slow to use /ontol/identifier (not just relative to _taxon_map, but something that could lead to app slowdown or server overloading or something), I'd definitely want to use /ontol/identifier. The problem with _taxon_map is that it would add significant complexity to my code. I'd have to "carry around" that data until it's needed, either in a global accumulative map or in the node information, and have special logic for parsing it, etc. Also keep in mind that I don't actually need this information at all unless/until a user actually filters a search by taxon (a filter that wont always even appear). So I think it makes sense to do the request on demand.

Another thing to note, one of our stated/agreed-upon goals of this rewrite was to avoid hard-coded and one-off exceptions, because the 2.0 UI is full of them and causes much difficulty.

Full-text matches it is. Also, I agree that returning the full set of ID results per label is more useful; if you want to match the behavior of the existing /ontol/labels/ endpoint you'd just take the first element, as you mentioned. Regarding case-insensitivity, I was eventually able to figure it out, but it's very slow (several seconds vs. milliseconds) in comparison to doing an exact (i.e., case-sensitive) match. I assume this is because the label field isn't indexed as case-insensitive, so it has to do a full map of the label column to lowercase, then a linear search for the query phrase.

he problem with _taxon_map is that it would add significant complexity to my code. I'd have to "carry around" that data until it's needed, either in a global accumulative map or in the node information, and have special logic for parsing it, etc.

Ah, right, then yeah, I'd be fine dropping it in that case.

Hmm that is really slow. I kinda threw out that "case-insensitively" very causally, I hope you didn't feel like you had to spend time on it.

Case sensitive should be fine, assuming there's no funny business, like the facets field returning modified case or some other endpoint needing a different casing.