airr-community/adc-api

enhance the test data set with more variability

Opened this issue · 6 comments

We should add in more variability to the test data set, for repertoires, so we can test different scenarios, in particular for arrays.

  • Arrays; some repertoires should have multiple sample entries.
  • Arrays of arrays; multiple pcr_target entries for multiple sample entries.
  • Arrays; multiple entries for data types of string, number, integer and boolean.

Should we add this variability by adding other "artificial" studies to the GOLD data set and keep Florian's data "pristine" in the sense that it is an accurate representation of a real study??? It seems to me that there is some value in having a "correct and complete" study as part of the GOLD standard. If we want to test API features, maybe we should create some data specific to those tests rather than add features to Florian's metadata to enable those tests???

It's not pristine though. The data set has already been modified for test purposes.

I suppose I am suggesting that we should go back to the Florian original data for the Gold data and for any changes that were introduced to test API features create new artificial data sets that test them. It doesn't seem like a good idea for us to produce a downloadable study repertoire metadata file for a real study that has repertoire metadata that isn't correct for that study.

If it's that important to you, go for it. I'm really not interested in giving myself more maintenance work by having a "completely correct study" in multiple places. If it bothers you that the data is too close to Florian's, it's easy enough to randomize values, or get rid of the Florian data and put in one of your curated data sets...

I think it is helpful to the community to have a real data set as an example, and given that we have Florian's data, we currently use it as an example data set for many things, and we have tests that run against it, an easy path for me seems to be to leave that data alone as an example data set. It is a single repertoire JSON file and a bunch of TSV files that describes a real study... It demonstrates the use of both the AIRR JSON Repertoire file format and the AIRR TSV Rearrangement file format. This is good...

Further, if you load that data into an AIRR compliant repository and query it with the ADC API, we have a set of expected results that you should get and a test suite and mechanism to run such a test (the adc-api-test github). As far as Florian's data goes, I don't think we need to do any more work than that...

The question is how do we test queries that aren't applicable to Florian's data. I worry about the data provenance of using Florian's data set to further test the API by changing the repertoire metadata to test our queries. We then have a file that incorrectly describes Florian's study - which seems like a bad idea... I think that is a different type of test, and perhaps one that we shouldn't use Florian's data for.

This means we don't have to do any more work around Florian's data, but if we want to test some more complicated queries on data that is more complicated than Florian's (data that would actually exercise those queries) then I am thinking that artificial data might be better. This will require more work...

Perhaps we don't need to do that at all - I am not sure that we need test queries that provide 100% coverage. We have extensive query tests that cover a real data set - perhaps that is enough for us to provide from a community perspective... We might want something more for internal unit testing for VDJServer and iReceptor, but that doesn't need to be of the same quality and made available publicly like the public data set and testing suite that we already have in Florian's data...

haha, sounds like you don't want to do the work either ;-D

real data set as an example...

That's not my purpose for this test suite. My primary interest is a test suite that tests the functionality of the API. I'm not interested in expanding the scope.