transmart-data

Introduction

This repository is a set of make files and scripts for:

creating the tranSMART database and all its objects;
populating the database with the bare minimum data necessary to start tranSMART and use the ETL scripts;
generating the SQL scripts and dependency metadata necessary for the two points above, from a clean database;
fixing permissions and tablespace assignments;
importing some example data - these are intended for development, the targets are not robust enough to be generally applicable;
running the Solr cores for Faceted Search and the Sample Explorer;
generating configuration files for tranSMART.

The current schema is the one necessary to support the master branch of transmart (release 1.2) for Oracle and Postgres

This goal is to have this project displace transmart-DB by providing a better way to manage the tranSMART database.

This project does not handle database upgrades and is therefore more adequate for development. Using Liquibase here or some other more ad hoc solution for that problem is being studied.

A script is available that can compare dumps from two databases, intended for an installed copy to be compared to the latest transmart-data.

Requirements

The following are required:

GNU make
PostgreSQL client utilities (psql, psql_dump, etc.)
curl
php (>= 5.4)
tar with support for the -J switch (GNU tar only?)
An up-to-date checkout of the tranSMART-ETL repository. Revision e712fcd7 is necessary for Faceted Search support (ETL only)
Groovy (>= 2.1). Can be installed with make -C env groovy and updating the PATH (Oracle and some secondary functionality only)
Kettle (ETL only)
rsync (Solr only)
Java environment (JDK 7)
build-essentials tools and gcc-fortran (building R from source only)

If you are using Ubuntu and you intend to use PostgreSQL, you should be able to install all these dependencies by running

sudo make -C env ubuntu_deps_root
make -C env ubuntu_deps_regular

which will also prepare some directories for the tablespaces and assign them the correct ownership .

Usage

Start with copying the vars.sample file, editing it and sourcing it in:

cp vars.sample vars
vim vars
# edit file and save...
. ./vars

If you ran make -C env ubuntu_deps_regular, you will have a vars file created for you. You can skip the previous step and do only:

. ./vars

The several options are fairly self-explanatory.

PostgreSQL-specific notes

The configured PostgreSQL user must be a database superuser. You can connect to PostgreSQL with UNIX sockets by specifying the parent directory of the socket in PGHOST. In that case, localhost will be used in the situation where UNIX sockets are not supported, such as for JDBC connections.

The variable $TABLESPACES is the parent directory for where the tablespaces will be created in the PostgreSQL server's filesystem.

The database creation target assumes the PostgreSQL server runs on the same machine. The target does not attempt to create the tablespaces if they already exist, but it will nevertheless attempt to create the directories where it thinks the tablespaces should be according to $TABLESPACES.

Note that if the tablespaces directories do not already exist or are not assigned the correct owner (i.e., the user PostgreSQL runs as), then the install target will run into problems when attempting to create the tablespaces. If the user PostgreSQL runs as and the user running the targets are not the same AND the tablespace directories do not already exist, then manual intervention is necessary for creating all the tablespace directories and assigning them the correct owner.

Oracle-specific notes

This install script assumes SYSDBA privileges in order to create the users that the application will use.

You will need to create the tablespaces TRANSMART and INDX (sic) manually before creating the database. The install script can try to do this automatically if you set ORACLE_MANAGE_TABLESPACES=1, but in that case you will need to set ORACLE_TABLESPACES_DIR to a directory name (on the database server) where Oracle can automatically create the files to store the new tablespaces and other database objects.

Transmart requires the 'Partitioning' feature of the database, so make sure that feature has not been disabled in the Oracle database. 'Partitioning' is (as of Oracle Database 12) only supported in Oracle Database Enterprise Edition, so Transmart will not run against a lower level edition of Oracle.

Drop the database

make postgres_drop

make oracle_drop

Create the database and load everything

make -j4 postgres

make -j4 oracle

For PostgreSQL, you can skip the tablespace assignments, which are not really important for development, by setting the environment variable skip_fix_tablespaces to any non-empty value:

skip_fix_tablespaces=1 make -j4 postgres

The Oracle version, on the other hand, does not manage tablespaces by default. For forcing, use ORACLE_MANAGE_TABLESPACES:

ORACLE_MANAGE_TABLESPACES=1 make -j4 oracle

For PostgreSQL, there's a simple script in data/postgres/set_password.sh for changing users' passwords. If you're using Oracle, you can still use part of it to generate the hashes.

Only fix permissions, owners or tablespace assignments (PostgreSQL only)

These can be done with the targets fix_permissions, fix_owners and fix_tablespaces, under ddl/postgres/META. Example:

make -C ddl/postgres/META fix_permissions

Running ETL targets

This project can be used to download public or privately available datasets and load these into the database. Typically, but not always, this loading involves using Kettle jobs (with kitchen) available in an existing checkout of (tranSMART-ETL)[ts-etl] .

Data fetching is done through a system of feeds that publish several data sets. A data set is a combination of meta data and, for most data types, properly formatted data files. The most important meta data, the meta data that uniquely identifies the data set, are the study name and the tranSMART data type (clinical data, mRNA data, etc.). A feed is an object that provides a list of triplets consisting of study name, data type and location of a tarball containing the data and full metadata for the dataset.

The tarballs must follow the following rules:

they must be compressed with xz;
their filename must be in the form <study name>_<data type>.tar.xz;
they must have a file named <data type>.params in their root;
most data types require, under the root, a directory named <data type>, under which the data files should be located.

For instance, a clinical data tarball could contain the following tree:

.
|-- clinical
|   |-- E-GEOD-8581.sdrf-rewrite.txt
|   |-- E-GEOD-8581_columns.txt
|   `-- E-GEOD-8581_wordmap.txt
`-- clinical.params

Feeds are listed in samples/studies/public-feeds and, optionally, in a git-ignored private-feeds, in the root of the project. Each file contains a list of feeds in the following format:

<feed type> <type-specific feed location data>

The two supported feed types right now are http-index and ftp-flat. Examples:

http-index http://studies.thehyve.net/datasets_index
ftp-flat ftp://studies.thehyve.net/

The initial public-feeds download uses a TranSMART Foundation host. Changing the URL in samples/studies/public-feeds will use an alternative source for all data downloads.

The type http-index points to a plain text file that lists the available datasets in the format <study> <type> <tarball relative path>. Example:

Cell-line acgh Cell-line/Cell-line_acgh.tar.xz
Cell-line clinical Cell-line/Cell-line_clinical.tar.xz

The type ftp-flat points to an FTP directory that should store, directly underneath it, all the dataset tarballs.

The list of datasets is automatically downloaded when needed, but it is not automatically updated. To update it, run at the root:

make update_datasets

If a data set (study/type combination) is repeated, transmart-data will assume it is the same data and will download randomly from any source. If any download fails, another source will be tried until there is none left.

When developing the data sets, the best course of action is to place the files directly under samples/studies/<study name>. The contents of this directory should include the contents of the tarball. When finished, the tarball can be created with (running from the root of the project):

tar -C samples/studies/<study name> \
    -cJf <study name>_<data type>.tar.xz <data type>.params <data type>

Data sets can be loaded by running a target named load_<data type>_<study name> in either samples/oracle or samples/postgresql, as appropriate.

For instance:

make -C samples/{oracle,postgres} load_clinical_GSE8581
make -C samples/{oracle,postgres} load_ref_annotation_GSE8581
make -C samples/{oracle,postgres} load_expression_GSE8581

Do not forget to update your Solr index, if your setup requires it to be triggered manually.

For MacOSX

The loading scripts use the -e option from the function readlink. This is a function that is not in the readlink that is installed on Mac OSX, to bypass this problem you are required to install greadlink (stands for GNU readlink). After installing greadlink edit ~/transmart-data/samples/postgres/process_params.inc and on change readlink to greadlink (line 20). Save the changes and the upload should work.

Starting Solr

To start a Solr instance with one core for Faceted Search and another for the sample explorer:

make -C solr start

Once it is running, you can run full imports with:

make -C solr browse_full_import rwg_full_import sample_full_import

The Faceted Search core also supports delta imports:

make -C solr rwg_delta_import

Due to different functionality in tranSMART versions targeting each RDBMS, there's a separate Solr core for Oracle:

ORACLE=1 make -C solr start
ORACLE=1 make -C solr sanofi_full_import

Document (e.g. PDFs) reindexing has a special procedure.

Running Rserve

You can install a recent version of R from your distro and run the R scripts inside the R directory to install the required packages. You can also build R from source:

make -C R -j8 root/bin/R
make -C R install_packages

Beware that if you run install_packages with concurrency some packages will fail building and you may have to run the command again.

There is an init script and a systemd unit available to run Rserve. It has to be run as the same user as the tranSMART web application:

TRANSMART_USER=<tomcat user> make -C R install_rserve_init
update-rc.d rserve defaults 85 # to enable the service

This is applicable to the Debian-family distributions. See also the install_rserve_unit target for systemd based distros.

Changing ownership or permission information

The default schema permissions are set in ddl/postgres/META/default_permissions.php. Tables where biomart_user should be able to write are defined in ddl/postgres/META/biomart_user_write.tsv. Extra miscellaneous permissions are defined in ddl/postgres/META/misc_permissions.tsv.

Ownership information can only be added by editing an array in ddl/postgres/META/assign_owners.sql.

Writing the configuration files

For development, the default configuration should be sufficient. For production, the configuration may have to be changed, but even if it doesn't, some directories described in the configuration will need to be created.

If you need to change configuration parameters, you can change the files in ~/.grails/transmartConfig that are created by the install target, but they will be overwritten (after being backed up) the next time you install the configuration again using the same target. Therefore, it is preferable to copy config/Config-extra.php.sample into config/Config-extra.php and edit the new file. In this file, you can edit two blocks of text which will be inserted into two different points in the configuration template, allowing you override any option that the configuration template sets.

To install the configuration files into $TSUSER_HOME/.grails/transmartConfig, ($TSUSER_HOME is one of the environment variables defined in the vars file) run:

make -C config install

Generating new import files from model database (PostgreSQL)

This part still needs some work, but it goes more or less like this:

Start with a clean database. Preferably load if from scratch with make postgres and make only the necessary changes on top of this.
Go to ddl/postgres.
Run make clean.
Manually delete the subdirectories with the names of the schemas that you changed.
Run make <schema name>_files for every schema whose directory you deleted.
(If you want to regenerate all the schemas, you can replace the last three steps with make clean_all dump files_all).
If you have data changes, go to data/postgres. If you need to dump data from a table no data was being dumped from before, add that table to one of the <schema name>_list files.
Run make dump. You can run make clean_dumps to delete all the dumps, which might be useful e.g. if you deleted tables from the lists of tables that should be dumped.
If you changed global objects like roles (unlikely), look at the makefile on ddl/postgresql/GLOBAL for more information.
Make sure you test the changes by recreating the database.

Note that pg_dump does not output the data in any specific order; after a table is modified, it will dump the old data out of order. This results in a larger changeset than necessary and obfuscates the actual changes made to the table's data. A target is included that attempts to remedy this problem by changing the modified files so that the old rows are kept at the top of the modified file in the same order (but with the removed rows left out):

make -C data/common minimize_diffs

Use this target after modifying the tsv files but before checking them into the git index with git add. The procedure compares the version in the working directory with the version in the git index.

To update the plugin module params from the database (a doubtful course of action; better to edit the files in data/common/searchapp/plugin_modules_params/), then do:

cd data/postgres
make searchapp/Makefile
make -C searchapp dump_plugin_module
make -C ../common/searchapp/plugin_modules_params process_dump

Generating new import files from model database (Oracle)

For Oracle, only schema dumping and loading is supported:

make -C ddl/oracle dump
make -C ddl/oracle load

A separate branch exists that supports loading data from the TSV files dumped by PostgreSQL.

anthony-o/transmart-data