airr-community/adc-api

facets returning incorrect result on array fields

Opened this issue · 5 comments

from @bcorrie in an iReceptor issue:

returning an array (and indeed arrays of arrays) for some fields, which I am pretty sure is not what we want…

$ curl --INSECURE --data '{"facets":"sample.pcr_target.pcr_target_locus"}' https://localhost/airr/v1/repertoire | jq

{
  "Info": {
    "title": "AIRR Data Commons API for VDJServer Community Data Portal",
    "description": "VDJServer ADC API response for repertoire query",
    "version": 1.3,
    "contact": {
      "name": "VDJServer",
      "url": "http://vdjserver.org/",
      "email": "vdjserver@utsouthwestern.edu"
    }
  },
  "Facet": [
    {
      "sample.pcr_target.pcr_target_locus": [
        [
          "TRB"
        ],
        [
          "TRB"
        ]
      ],
      "count": 10
    },
    {
      "sample.pcr_target.pcr_target_locus": [
        [
          "IGH"
        ],
        [
          "IGH"
        ]
      ],
      "count": 1
    },

[..snip..]

  ]
}

This looks to be from doing a simple count aggregation against a field in an array, the code snippet

	if (query) agg.push({ $match: query });
	agg.push(
		{ $group: {
		    _id: '$' + facets,
		    count: { $sum: 1}
		}});

Can we do this then "clean up" the response by "flattening" the counts?

@bcorrie In the case of repertoires with multiple samples, what count should be returned? For example, say we have two repertoire, each with 3 sample processing records with TRB for pcr_target_locus. Do we return 6 because there are six samples with TRB loci or do we return 2 because there are 2 repertoires with at least one sample that's TRB?

It seems intuitive to me that this should be giving you counts where the sum is a sum of each instance of a TRB - so that would be six. I think you have this problem on other sample level fields.

curl --INSECURE --data '{"facets":"sample.sample_id"}' https://vdj-staging.tacc.utexas.edu/airr/v1/repertoire

gives odd counts like:

        {
            "sample.sample_id": [
                "P1_6w"
            ],
            "count": 2
        },
        {
            "sample.sample_id": [
                "P1_6w",
                "P1_6w",
                "P1_6w"
            ],
            "count": 1
        }

This seems like it should be 5 given that it appears that sample_id occurs 5 times

and

        {
            "sample.sample_id": [
                "Donor3_naive_IGL",
                "Donor3_naive_IGH"
            ],
            "count": 1
        }

This seems like it should be two different counts, one per unique sample_id with a count of 1 for each sample_id

Hmm, these are "interesting"...

        {
            "sample.pcr_target.pcr_target_locus": [
                [
                    "IGH",
                    "IGL"
                ],
                [
                    "IGH"
                ]
            ],
            "count": 1
        },
        {
            "sample.pcr_target.pcr_target_locus": [
                [
                    "IGL"
                ],
                [
                    "IGH"
                ]
            ],
            "count": 5
        },
        {
            "sample.pcr_target.pcr_target_locus": [
                [
                    "IGH",
                    "IGK",
                    "IGL"
                ]
            ],
            "count": 30
        },
        {
            "sample.pcr_target.pcr_target_locus": [
                [
                    "IGH",
                    "IGL",
                    "IGK"
                ],
                [
                    "IGH",
                    "IGL",
                    "IGK"
                ]
            ],
            "count": 1
        }

I assume these are being caused by a combination of repertoires having multiple samples (the outer array) and some samples having multiple pcr_target_locus values (the inner array) in different combinations of the above?

It seems intuitive to me that this should be giving you counts where the sum is a sum of each instance of a TRB - so that would be six. I think you have this problem on other sample level fields.

I understand that, but I'll offer a counterpoint. The question is a count of "what". On the repertoire end point, I'd think it would be useful if it always a count of repertoires. Likewise for rearrangements. Then the client can always say with the facets data, "X repertoires have such and such". If it's "every instance of a TRB", the context of the count changes based upon the field, it's the count of samples, or the count of pcr_targets, or the count of something else, and sometimes the count of repertoires. This wouldn't be an issue if we didn't have arrays in Repertoire, then it would always be counts of repertoires.

So it comes down a bit to what do we think is the most useful count of "what" to provide.

I assume these are being caused by a combination of repertoires having multiple samples (the outer array) and some samples having multiple pcr_target_locus values (the inner array) in different combinations of the above?

Yes, exactly, so what should the IGH count be? 37, as the sum of all? Or 39 because IGH appears twice in two of them?

Unfortunately, I'm not sure I can extract the count of repertoires from this??? Maybe, it's 37 too.