airr-community/common-repo-wg

Define what we mean by /repertoire and /rearrangement in decisions document

bcorrie opened this issue · 8 comments

Hi All,

Should we provide a description/definition in the decision document (at least as far as we currently have a description/definition) as to what we mean by repertoire and rearrangement? Perhaps a few sentences as to what is meant by these terms? If nothing else, perhaps point to the specs used in the API to implement them (and by default referring to MiAIRR) such that one can more easily see what is the intent of the two types of "gettable" objects. I am not the person to write that description or I would do it myself 8-)

Brian

I recently tried to take a stab at defining what a repertoire was, and I failed dismally... I don't think I failed because I couldn't define the immune repertoire. I think I failed in trying to describe the objects that the API endpoint /repertoire returns as a repertoire. I actually think that this API endpoint may be named incorrectly. My rationale follows:

My understanding is that a repertoire is the set of B-cells and T-Cells that the body can produce. At a given point in time, an individual has a specific "repertoire" of these cells in their body, and the sequencing that is performed is essentially a sampling of an individuals specific immune repertoire. Feel free to correct my naive computer scientist definition 8-)

The idea behind an API entry point is typically that it returns an object (or a list of objects) with the API entry point name indicative of the type of object returned.

For example, a /user endpoint might return: [{username:"bcorrie", first_name:"Brian", last_name:"Corrie"}]

Our /repertoire endpoint returns a list of objects that contain the metadata that describes an object that is ultimately run through an annotation tool to get a set of rearrangements. The rearrangements acquired through sequencing and annotation are the sampling of the repertoire. The repertoire object in our case contains information about the study, subject, sample, diagnosis, cell processing, nucleic acid processing, and software processing that was performed to acquire the set of rearrangements. My question is, is the metadata returned by the API really describing a repertoire?

My question comes from the fact that it is hard (at least for me) to write a concise description of what the /repertoire end point returns. I think the /rearrangement API entry point is quite straightforward to define. If /repertoire is not easy to define, maybe it implies that maybe something isn't quite right??? Or it might imply that I am bit dense. 8-)

Of course, I don't have an alternative suggestion either.

I made a crude attempt at some example one or two liners for discussion tomorrow. I am sure they are not close to being what we want, but a start.

https://github.com/airr-community/common-repo-wg/blob/issue-17/minutes.md

We aren't defining repertoire as the absolute definition you gave. We are using repertoire that is defined by the study designer. The "definition" is the combination of the metadata and a set of observed/sampled sequences. One might study the "T-cell repertoire" or "B-cell repertoire" but they might get more specific like "naive T-cell repertoire" or "IGVH4 B-cell repertoire" and so on. Other researchers may take the same data and "redefine" the repertoire, but they create new metadata to do that.

I imagine your disconnect is between the biological concept and the informatic concept. We want repertoire to be the biological concept, as that's how users of an AIRR repository will be thinking. But you are correct in that we need to define a repertoire schema object that encapsulates that biological concept and is provided through the /repertoire endpoint. What I currently have defined isn't quite correct and needs to more refinement.

This is where things get a little complicated because software processing is not part of the definition for repertoire. The definition stops at nucleic acid processing where you have a set of FASTQ files containing the observed sequences. However, we want to provide that software processing info, and it happens to be more convenient to provide that through the /repertoire endpoint.

What we don't want to do is confuse users by saying processing sequences with two annotation tool (say IgBlast, MIXCR) produces two repertoires. No, they are both the same repertoire but they are two different sets of rearrangement annotations for those sequences. To handle this situation, we decided (Decision 8 in the minutes) that only rearrangements would be returned for one annotation tool with some field to indicate when additional tools are available.

So I think of /repertoire as returning repertoires with additional information about how the sequences were annotated.

Now if you really want a brain twister, think more deeply about a rearrangement. There is the biological concept of a rearrangement, but the annotations we have are computational inferences built upon more computational inferences built upon multiple layers of experimental/biological sampling, all with their own inherent biases, some known and some unknown.

@bcorrie, here is what I am thinking right now. I created a RearrangementSet object that encapsulates the processing of the raw sequences for a repertoire into a set of rearrangement annotations. Individual Rearrangement objects would have a rearrangement_set_id field as the foreign key.

# The composite schema for the repertoire object
#
# This represents a sample repertoire as defined by the study
# and experimentally observed by raw sequence data.
Repertoire:
    discriminator: AIRR
    type: object
    properties:
        repertoire_id:
            type: string
            description: Identifier for the repertoire object.
        study:
            $ref: '#/Study'
        subject:
            $ref: '#/Subject'
        sample:
            allOf:
                - $ref: '#/Sample'
                - $ref: '#/CellProcessing'
                - $ref: '#/NucleicAcidProcessing'
        sequence_data:
            $ref: '#/RawSequenceData'


# 1-to-n relationship between a repertoire and rearrangement sets
#
# Set of annotated rearrangement sequences produced by
# software processing upon the raw sequence data for a repertoire
RearrangementSet:
    discriminator: AIRR
    type: object
    properties:
        rearrangement_set_id:
            type: string
            description: Identifier for the rearrangement set object.
        repertoire_id:
            type: string
            description: Link to the repertoire	that was processed to produce rearrangment set
        software:
            $ref: '#/SoftwareProcessing'

@bcorrie, take a look at my recent update to the AIRR schema.

  • Changed RawSequenceData object to be SequencingRun, putting some fields from NucleicAcidProcessing into it. From discussions with @bussec.
  • Repertoire is more clearly defined to be all the study metadata up to the raw sequencing data files recorded in the SequencingRun.
  • RearrangementSet defines a SoftwareProcessing on a Repertoire. This allows multiple annotations tools to be run. The rearrangement_set_id is the actual identifier in individual rearrangement records.
  • Note that the Repertoire object is not exactly what is returned by the /repertoire endpoint. Instead the returned object will be a composition of Repertoire and RearrangementSet, yet to be defined.

There is an issue though. Given some rearrangement annotations (AIRR TSV), you cannot use the rearrangement_set_id to directly look up the repertoire from the /repertoire/{repertoire_id} endpoint, you would need to perform a query. One simple idea is to add repertoire_id to the rearrangement annotations.

We need to think a bit more precisely about how we want AIRR repositories to handle multiple software processing workflows, and specifically when querying the /rearrangement endpoint, what is provided as input. If we pass a list of repertoire identifiers then that isn't sufficient to specify which SoftwareProcessing you want, you need to pass a list of rearrangement_set_ids to be precise.

we now have an ADC API manuscript and extensive documentation that discusses the repertoire metadata schema in detail, I think we can leave the decisions document alone as a historical document.