HumanCellAtlas/data-consumer-vignettes

ElasticSearch fields are opaque to consumers

natanlao opened this issue · 1 comments

Many of the vignettes in this repository make POST /search requests to filter bundles based on their contents. For example, in the old Download SmartSeq Expression Matrix for Scanpy notebook, this ElasticSearch query was used to find recent bundles with .results files:

query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {                                                           # It needs to
                        "files.file_json.files.content.file_core.file_format": "results" # have a
                    }                                                                    # results file...
                },
                {
                    "range": {
                        "manifest.version": {
                            "gte": "2018-07-12T100000.000000Z" # ...and preferably not be too old, either.
                        }
                    }
                }
            ]
        }
    }
}

The form of these queries is inaccessible to the researcher target audience of these vignettes. (In our case, two other DCP developers and I spent some thirty minutes trying to write a better-formed query that worked against prod, as the one above didn't. I imagine that, for an unaffiliated researcher, the process would have been much more difficult.)

There is no clear, accessible documentation that researchers can access to determine what fields they should query against in a POST /search request to filter through data programmatically.

(To underscore this point, the example query on the HumanCellAtlas/data-store README returns no results against prod.)

This seems like something that the query service component should be able to help with