airr-community/common-repo-wg

Repertoire API return rearrangement count?

Closed this issue · 6 comments

Hi @schristley

One of the things that the current iReceptor API returns for /samples is the count of the number of rearrangements for each sample. For those not in the know, /samples in the iReceptor API is basically the equivalent of /repertoire in the AIRR API.

The return of the rearrangement count for a repertoire is a convenience value that is returned by the API that for iReceptor at least, is pretty fundamental to the purpose of the API call. That is, we always want to know how many rearrangements are associated with a repertoire. It is one thing to know that there is a repertoire, but it seems to me that the next question one would likely ask for each repertoire would be the number of rearrangements for that repertoire. Certainly this is something that iReceptor always wants (we want to tell the user if the repertoire they found had 10 or 10 Million rearrangements associated with it).

My question is, does it make sense to provide this as part of the /repertoire API directly (as we do in the current iReceptor API) as a summary statistic? Given that this is the fundamental link between the two conceptual levels of the API (and the AIRR Data Representation) this makes some sense to me.

The "facets" capability of the query API allows one to aggregate and count on a feature of the repertoire (e.g. "facets":"subject.subject_id"). I don't think there is a mechanism within the API to count the number of rearrangements for each repertoire, as it isn't part of the metadata. It is a summary statistic of the repertoire's rearrangements and is an operation on the /rearrangements API . Thus in order to produce the equivalent functionality that we have today in the iReceptor API as a single call (returning a set of repertoires and their rearrangement counts), we would need to make N + 1 API calls, one query to the /repertoire API to get the set of repertoires that meet the query criteria and one query to the /rearrangements API for each of the N repertoires that are returned to ask for the count of the rearrangements in that repertoire.

At some level, this is a question as to how "clean" or "simple" we want the API to be in terms of "just" querying fields versus the functionality that we want it to play to meet the use cases that we have. This goes back to the use cases and takes the question to the next logical step - for the API calls that we are developing, what are users of the API going to do next and is there anything that the API can do to facilitate those next steps? This is potentially one example of such a case?

Thoughts?

Brian

I think it would only require 2 API calls as the query to /rearrangements could pass a list of repertoire_ids and facets would give a count for each. However, it is still a good point because it's not clear how fast or efficient doing that facets operation would be, and whether we even want to allow facets on the /rearrangements entry point, or maybe we allow facets only for specific fields. I think we should do some performance tests because "count" can be very quick in Mongo 3.x with the right indexes.

@bcorrie It would be helpful for you to add some additional query scenarios that cover use cases specifically for iReceptor gateway. Right now iReceptor usage isn't covered in our current example queries.

@schristley will do... The way I was thinking of the use cases was whether it was possible for iReceptor to implement them in our current API - and whether they covered the basic cases of what our users were doing. This was pretty well covered, so I hadn't added anything.

I will add a section to the use cases from the perspective of what a Gaetway might want to do (which is similar to what we have) as well as what a gateway might want returned (which is what maybe we are missing at the moment).

It might be worthwhile to add uses for the data that other software tools might do with the data as well.

@schristley do you mean to add to the Google Doc example query document? This does not exist in the GitHub world does it?

@bcorrie I looked at the Google Doc, it's old and kind of a mess. I started writing a separate file meant to describe the queries, maybe add to that? My idea is that file would get integrated with airr-standards documentation at some point.

facets provides this counting, with the added benefit of being able to use filters to restrict the search. This issue started to diverge into questions about iReceptor use cases and query documentation. There is considerable ADC API documentation now with numerous examples, any issue that iReceptor runs into while moving to ADC API should be filed in new issues.