ga4gh/ga4gh-schemas

RNA expression data structure is inefficient

kozbo opened this issue · 7 comments

kozbo commented

From email chain with Rob:

Here’s a few stories based on [Treehouse’s](https://treehousegenomics.soe.ucsc.edu/) current Jupyter analysis [notebooks](https://github.com/UCSC-Treehouse/jupyterhub):

1) I have a mapping {feature -> threshold value} and a sample. I'd like to get a vector of (feature, expression value) for all features in that sample where expression value > feature[threshold].

2) I'd like to get the features in a sample sorted by expression value.

3) I'd like to generate a set of samples based on phenotype information (not just "disease = ALAL" but more complex queries, able to include or exclude samples by identifier)

4) I have a set of sample identifiers. For each feature in samples in that set, I'd like to get statistics on its expression values within that set (median, quartiles, interquartile range).


For #1 I tried to code ups something similar but that would work on 1kgenomes:

https://jupyter.medbook.io/user/rcurrie/notebooks/readonly/rcurrie/Treehouse%20GA4GH%20User%20Stories.ipynb

# All males from the GBR population
biosample_ids = [b.id for b in c.search_biosamples(dataset_id=dataset.id)
                 if b.info["Population"].values
                 and "GBR" in [v.string_value for v in b.info["Population"].values]
                 and "male" in c.get_individual(b.individual_id).sex.term]
print "Samples found:", len(biosample_ids)

Takes about 8 seconds to run. The same type of query for the RNASeq database but looking for “Thyroid” and “Femail” took over 3 minutes

But currently as far as I can tell it is impossible to get the expression levels for an individual or biosample due to curation so its kind of a dead end. My gut at this point is we can’t use the server for Treehouse due to the lack of server side query as for most of the stories we’ll end up basically walking through every individual, biosample, and expression level on the client in order to get what we need.

The Treehouse compendium is ~11k samples by 30k genes/features. Storing on disk in an hdf5 file take about 1GB and loads into a dataframe (R or Python) in < 600msec. Differential expression analysis typically requires getting a subset of the 'columns' - say 500 expression vectors after subsetting by diseases. This works out to ~500 * 30k features rows. I suspect the emerging single cell world will be much the same but with the addition of very sparse data lending itself to compression. The current GA4GH reference server database schema being single row per level doesn't get any of the optimizations of row nor column orientation, significantly expands the data (float into ~128 bytes) and can't be compressed. The protocol buffer schema has the same issue. Suggest this should be re-considered towards storing the levels in an array, either single row blob, or a column database, or external file with meta data in the existing relational database so that it can be usable for current clinical differential use cases as well as emerging single cell research cases.

kozbo commented

@saupchurch We were going to bring up this use case on the call today.

To bootstrap the effort of modeling the Genomics domain the way that we have, we've picked up a lot of assumptions of the underlying file types. Our variant representation is weighed down by VCF, read alignments by BAM, etc. The community moves slowly, and the schemas are our opportunity to decouple the database/file layer from the interaction layer.

As we evolve the API, in addition to adding the methods and fields that biologists find practical based on existing usage patterns, we should work to remove the legacy of the file representation and move more of that logic to standardized ETL pipelines, allowing biologists to spend more time reasoning about their domain.

An important concern Rob raises (thanks @rcurrie) is that we don't present useful methods for analysis. The same objection has been raised of the variants API. You can get everything back, and be able to reason about single documents well, but you always get more than you need.

I can imagine an interesting experiment where we provide an alternative way of querying expressions. Instead of splitting across ~5 requests to get an expression level you pass in a list of sample identifiers and feature names.

Add an endpoint called "expressionlevels/select" that takes a Select message containing a list of sample_id and feature_id (or name).

message RNASelectRequest {
  repeated string sample_ids = 1;
  repeated string feature_ids = 2;
}

It then returns a table with quantifications that match the requested sample_ids in the columns and the requested feature_ids as rows. Cells simply contain the expression level.

message ExpressionVector {
  string biosample_id = 1;
  string rna_quantification_id = 2;
  repeated float expression = 3;
}

message RNASelectResponse {
  // All expression vectors in the response are of the same length
  repeated ExpressionVector = 1;

  // A list of feature ids that matches the length of the expression   
  // vector.
  repeated string feature_ids = 2
}

I believe this is tractable based on the data model we currently present and would demonstrate a valuable analysis use case. It reduces the transfer required for the most common access pattern (show me these samples against these genes) down to a minimum. The returned response is essentially a table where each row is tagged with the sample it came from, and the experiment it came from.

Constructing a select request will require iterating over metadata to get the list of sample IDs one is interested in. With a list of genes and samples, one should be able to query arbitrarily large stores using different slicing techniques.

I would note that having the raw_read_count might also be useful as I believe that it is common to recalculate normalization when aggregating results. In that case we could provide a flag in the SelectRequest, or include both values in the vector.

This same pattern could be used with the variants API, I believe, where a list of sample IDs and variant IDs could be used to assemble a list of call vectors.

One of the problems with this approach is that it assumes that the data can be easily queried in the way the method presents. In practice, it requires that API implementors keep their data in structures that can be joined and merged. This assumption was presented for the Reads API, which has a method for assembling alignments from multiple BAMs in the ReadsSearchRequest. The problem is that there are few practical implementations that provide the ability to fetch across multiple BAMs. Because of this, the reference server has never implemented that feature fully.

Part of the value of the API we present is that it aims to be as low cost to implement over existing stores, tries to only present the methods required for interchange, and presents documents that can be reasoned about alone. We don't want everyone to need to have hadoop, or all their RNA in one table, or to look up in documentation/external metadata to see what a value means.

I believe a service could use the API as it presents itself to create the Select application described. Considering that, one can imagine optimizing the lower layer to cache API results that would satisfy various requests, eventually storing them in a database.

We should definitely work to make the API efficient, but I imagine you can think of this layer of the API like the FASTQ of read alignment. It is a low layer protocol that analysis applications will work on top of, and those analysis applications take the responsibility of filtering/removing keys, and making sure only the data needed to drive inquiry is transferred to you, the analyst.

The idea of a select query is a good one - this would essentially recreate an earlier proposed pattern where the ExpressionLevel objects were associated with a "featureSet" and retrievable by that set. Your idea @david4096 would do a similar thing with the added flexibility of making the "feature set" definable at the query side.

kozbo commented

+1 on this idea. It would be great to have @david4096 's approach implemented