Scripts to create and load data into the scxa-*
Solr indexes (for analytics and gene2experiment). Execution of tasks here require that bin/
directory in the root of this repo is part of the path, and that the following executables are available:
- awk
- jq (1.5)
- curl
Version 0.2.0 was used for loading the August/September 2018 Single Cell Expression Atlas release.
The setup on the CI is made to use authentication with default user and password. The calls assume these settings (solr:SolrRocks), but the user and password can be modified by doing:
export SOLR_USER=<new-user>
export SOLR_PASS=<new-pass>
To use default auth in a new solr cloud instance, upload test/security.json
to ZK as shown in the Setup auth
part of the run_tests_in_containers.sh
. To set up users in a production setting the script create-users.sh will receive two set of users:
ADMIN_USER=<admin-username>
ADMIN_U_PWD=<password>
QUERY_USER=<query-username>
QUERY_U_PWD=<password>
it will create both users, giving the first admin privileges and the second reading privileges only, delete the default user and set the instance to only work with authenticated users.
To create the schema, set the environment variable SOLR_HOST
to the appropriate server, and execute as shown
export SOLR_HOST=192.168.99.100:32080
After doing this you will need to copy the scatlas.owl
file to all your running SolrCloud containers. Set the SCXA_ONTOLOGY
environment variable to the path of the OWL file as mounted inside the container. Remember to prepend file://
to the value of the variable, e.g.: file:///opt/solr/server/solr/scatlas.owl
.
create-scxa-analytics-config-set.sh
create-scxa-analytics-collection.sh
scxa-analytics-v8
makes use of the BioSolr plugin to perform ontology expansion on document indexing. In order to enable BioSolr, there are 3 options:
Place BioSolr jar (which can be found in the repository's lib
directory) under /server/solr/lib/
in your Solr installation directory. This is the oldest option, and has some security issues, but for testing should be fine.
Newer versions of solr introduced a new approach, named package manager, to deal with 3rd party JARs and files to be made available to solr. This implies the following steps:
- Create a set of private/public keys (you can run create-keys-for-tests.sh as shown in run_tests_in_containers.sh and keep those).
- Start solr cloud with the
-Denable.packages=true
as done in the CI. - Upload the public key to solr through Zookeeper (see how the
SIGNING_*
variables are used and theUpload der to Solr
part, both in run_tests_in_containers.sh). - Sign the JAR file with the private key and upload it to the solr file store (in our case, BioSolr solr-ontology-update-processor-2.0.0.jar, done by upload-biosolr-lib.sh in the analytics.bats, noting that it is running inside the solr container and that for this purpose, the private key was mounted inside that container on startup).
- Create the package
biosolr
(done as well by upload-biosolr-lib.sh) in solr pointing to that signed JAR in the solr file store. - Verify the package (done as well by upload-biosolr-lib.sh).
- Deploy the package as part of the schema creation (done by create-scxa-analystics-schema.sh).
In the CI, all these steps are done. In some cases, through the API, and in some cases through direct bin/solr
calls, which might require a container with the same solr version plus the URI to the desired solr server (or execute them inside the same solr server).
Please note that for changes in the Solr version, most likely changes in BioSolr plugin will be required, at the very least to point to the newer Solr version, and hence a new JAR will need to be added here. Version 2.0.0 was built against Solr 8.7 (as used in the CI).
create-scxa-analytics-schema.sh
You can override the default target Solr collection name by setting SOLR_COLLECTION
, but remember to include the additional v<schema-version-number>
at the end, or the loader might refuse to load this.
For the Single Cell Expression Atlas, run the script:
create-scxa-analytics-suggesters.sh
We are using multiple dictionaries (dictionaryImpl) for a single SuggestComponent
to fetch various suggestions.
- ontologyAnnotationSuggester
- ontologyAnnotationAncestorSuggester
- ontologyAnnotationParentSuggester
- ontologyAnnotationSynonymSuggester
- ontologyAnnotationChildSuggester
For the SCXA, to build suggesters with multiple dictionaries on the Solr, run this script:
build-scxa-analytics-suggestions.sh
This module loads data from a condensed SDRF in an SCXA experiment to the
scxa-analytics-v8
collection in Solr. Temporary files are created as part of
this process; by default they are written to $PWD
but this can be overridden
by exporting the $WORKDIR
variable. You should make sure that the running
user has write permissions to either the current working directory, or
$WORKDIR
if it has been set.
export SOLR_HOST=192.168.99.100:32080
export CONDENSED_SDRF_TSV=../scxa-test-experiments/magetab/E-GEOD-106540/E-GEOD-106540.condensed-sdrf.tsv
load-scxa-analytics.sh
In order to delete a particular experiment's analytics Solr documents based on its accession from a live index, do:
export EXP_ID=desired-exp-identifier
export SOLR_HOST=192.168.99.100:32080
delete_scxa_analytics_index.sh
Tests are located in the tests
directory and require Docker to run. To run them, execute run_tests_in_containers.sh
. The tests
folder includes example data in TSV (a condensed SDRF) and in JSON (as it should be produced by the first step that translates the cond. SDRF to JSON).
To create the schema, set the environment variable SOLR_HOST
to the appropriate server, and execute as shown
export SOLR_HOST=192.168.99.100:32080
create-scxa-gene2experiment-config-set.sh
create-scxa-gene2experiment-collection.sh
create-scxa-gene2experiment-schema.sh
You can override the default target Solr collection name by setting SOLR_COLLECTION
, but remember to include the additional v<schema-version-number>
at the end, or the loader might refuse to load this.
This module loads data from a
Matrix Market rows file (set
in env var MATRIX_MARKT_ROWS_GENES_FILE
) containing gene identifiers in the
rows for a SCXA experiment to the scxa-gene2experiment-v1
collection in Solr.
The experiment accession needs to be set in the environment variable EXP_ID
.
These routines expect the collection to be created already, and work as an
update to the content of the collection (deduplicating
experiment_accession,gene_id
tuples). Temporary files are created as part of
this process; by default they are written to $PWD
but this can be overridden
by exporting the $WORKDIR
variable. You should make sure that the running
user has write permissions to either the current working directory, or
$WORKDIR
if it has been set.
export SOLR_HOST=192.168.99.100:32080
export EXP_ID=E-GEOD-106540
export MATRIX_MARKT_ROWS_GENES_FILE=../path/to/E-GEOD-106540.aggregated_counts.mtx_rows
load-scxa-gene2experiment.sh
In order to delete a particular experiment's gene2experiment Solr documents based on its accession from a live index, do:
export EXP_ID=desired-exp-identifier
export SOLR_HOST=192.168.99.100:32080
delete-scxa-gene2experiment-exp-entries.sh
Tests are located in the tests
directory and require Docker to run. To run them, execute run_tests_in_containers.sh
. The tests
folder includes example data in Matrix Market format.
The container is available for use at quay.io/ebigxa/index-scxa-module at latest or any of the tags after 0.2.0, so it could be used like this:
docker run -v /local_data:/data \
-e EXP_ID=<the-accession-of-experiment> \
-e SOLR_HOST=<solr-host:solr-port> \
-e MATRIX_MARKT_ROWS_GENES_FILE=<path-inside-container-for-matrixMarkt-file> \
--entrypoint load_scxa_gene2experiment_index.sh \
quay.io/ebigxa/index-scxa-module:latest
Please note that MATRIX_MARKT_ROWS_GENES_FILE
needs to make sense with how you mount
data inside the container. You can change entrypoint and env variables given to use the other scripts mentioned above.