Data explorer
Overview
Data Explorer lets you explore a dataset. The code (in this repo and data-explorer-indexers repo) is dataset-agnostic. All dataset configuration happens in config files.
Examples:
- Data Explorer for the 1000 Genomes dataset. Config files here and here.
- Data Explorer for the Framingham Heart Study Teaching Dataset. This Data Explorer demonstrates time-series visualizations. Config files here and here.
Quickstart
Run local Data Explorer with the 1000 Genomes dataset:
- If
~/.config/gcloud/application_default_credentials.json
doesn't exist, create it by runninggcloud auth application-default login
. docker-compose up --build
- Navigate to
localhost:4400
- If you want to use the Save in Terra feature, do this one-time setup.
Run local Data Explorer with a custom dataset
-
Index your dataset into Elasticsearch.
Before you can run the servers in this repo to display a Data Explorer UI, your dataset must be indexed into Elasticsearch. Use an indexer from https://github.com/DataBiosphere/data-explorer-indexers. -
Create
dataset_config/<my dataset>
- If you used https://github.com/DataBiosphere/data-explorer-indexers, copy the config directory from there.
- Copy and fill out ui.json.
(
ui.json
is not indata-explorer-indexers
repo.) - If you used your own indexer, copy the config files from here
and here.
All files except
gcs.json
must be filled out.
-
If you want to use the Save in Terra feature, do this one-time setup.
-
If
~/.config/gcloud/application_default_credentials.json
doesn't exist, create it by runninggcloud auth application-default login
. -
DATASET_CONFIG_DIR=dataset_config/<my dataset> docker-compose up --build -t 0
- The
-t 0
makes Kibana stop more quickly afterCtrl-C
- If you get an error like
ui_1 | Module not found: Can't resolve 'superagent' in '/ui/src/api/src'
, add-V
:DATASET_CONFIG_DIR=dataset_config/<my dataset> docker-compose up --build -t 0 -V
.-V
is only needed for the next invocation ofdocker-compose
, not all future invocations. - If ES crashes due to OOM, you can increase heap size:
ES_JAVA_OPTS="-Xms10g -Xmx10g" docker-compose up --build -t 0
- The
-
Navigate to
localhost:4400
Architecture overview
The basic flow:
- Index dataset into Elasticsearch using an indexer from https://github.com/DataBiosphere/data-explorer-indexers
- Run the servers in this repo to display Data Explorer UI
GCP deployment:
For local development, an nginx reverse proxy is used to get around CORS:
Want to try out Data Explorer for your dataset?
Here's one possible flow.
- Run local Data Explorer with public 1000 Genomes dataset.
This makes sure docker and git are installed correctly. (A JSON cache of the 1000 Genomes indices is imported into Elasticsearch; no indexer is run.) - Run local BigQuery indexer with 1000 Genomes dataset
- Run locally with your dataset
- Deploy on GCP for your dataset
Sample file support
If your dataset includes sample files (VCF, BAM, etc), then Data Explorer will have:
-
A Samples Overview facet, which gives an overview of your sample files:
-
Sample file facets will display number of sample files instead of number of participants. For example, if your dataset has 100 participant and each participant has 5 files, and there is a facet for "Raw coverage", the number on the upper right of the facet can be 0-500, and represents how many sample files are in the current selection.
Time series support
If your dataset has longitudinal data, then Data Explorer will show time-series visualizations:
Development
Updating the API using swagger-codegen
We use swagger-codegen to
automatically implement the API, as defined in api/api.yaml
, for the API
server and the UI. Whenever the API is updated, follow these steps to
update the server implementations:
- Clear out existing generated models:
rm ui/src/api/src/model/* rm api/data_explorer/models/*
- Regenerate Javascript and Python definitions.
- From the .jar (Linux):
java -jar ~/swagger-codegen-cli.jar generate -i api/api.yaml -l python-flask -o api -DsupportPython2=true,packageName=data_explorer java -jar ~/swagger-codegen-cli.jar generate -i api/api.yaml -l javascript -o ui/src/api -DuseES6=true yapf -ir . --exclude ui/node_modules --exclude api/.tox
- From the global script (macOS or other):
swagger-codegen generate -i api/api.yaml -l python-flask -o api -DsupportPython2=true,packageName=data_explorer swagger-codegen generate -i api/api.yaml -l javascript -o ui/src/api -DuseES6=true yapf -ir . --exclude ui/node_modules
- From the .jar (Linux):
- Update API and UI servers.
- Don't forget to fix JS warnings. (Otherwise CircleCI will fail.)
One-time setup
-
docker-compose
should be at least 1.21.0. The data-explorer-indexer repo refers to the network created bydocker-compose
in this repo. Prior to 1.21.0, the network name wasdataexplorer_default
. Starting with 1.21.0, the network name isdata-explorer_default
. -
Install
swagger-codegen-cli.jar
. This is only needed if you modify api.yaml# Linux wget https://repo1.maven.org/maven2/io/swagger/swagger-codegen-cli/2.3.1/swagger-codegen-cli-2.3.1.jar -O ~/swagger-codegen-cli.jar # macOS brew install swagger-codegen
-
In
ui/
runnpm install
. This will install tools used during git precommit, such as formatting tools.
One-time setup for Save in Terra feature
The Save in Terra feature temporarily stores data in a GCS bucket.
- If you haven't already, fill out deploy.json
for your dataset.
- Even if you don't plan on deploying Data Explorer to GCP,
deploy.json
will still need to be filled out. A temporary file will be written to a GCS bucket in the project indeploy.json
, even for local deployment of Data Explorer. Choose a project where you have at least Project Editor permissions.
- Even if you don't plan on deploying Data Explorer to GCP,
- Create export bucket. This only needs to be done once per deploy project.
Run
deploy/create-export-url-bucket.sh DATASET
from the root of the repo, whereDATASET
is the name of the directory indataset_config
. - The Save in Terra feature requires a service account private key. Follow
these instructions to download a key. This needs to be done once per person
per deploy project. If three people run Data Explorer with the same deploy
project, then all three need to download a key for the deploy project.
- Go to the Service Accounts page for your deploy project.
- Click on the three-dot Actions menu for the
App Engine default service account
-> Create Key -> CREATE. - Move the downloaded file to
dataset_config/DATASET/private-key.json
Testing
Every commit on a remote branch kicks off all tests on CircleCI.
API server unit tests use pytest and tox. To run locally:
virtualenv ~/virtualenv/tox
source ~/virtualenv/tox/bin/activate
pip install tox
cd api && tox -e py35
End-to-end tests use Puppeteer and jest-puppeteer. To run locally:
# Optional: ensure the elasticsearch index is clean
docker-compose up --build -d elasticsearch
curl -XDELETE localhost:9200/_all
# Start the rest of the services
docker-compose up --build
cd ui && npm test
Troubleshooting tips for end-to-end tests:
- Uncomment headless to see the browser during test run.
- Run a single test:
npm test -- -t Participant
- More tips here
Formatting
ui/
is formatted with Prettier. husky is used to automatically format files upon commit. To fix formatting, in ui/
run npm run fix
.
Python files are formatted with YAPF.