Lochness: Sync data from all over the cloud to a local directory

Lochness is a data management tool designed to periodically poll and download data from various data archives into a local directory. This is often referred to as building a "data lake" (hence the name).

Out of the box there is support for pulling data down from Beiwe, XNAT, REDCap, Dropbox, external hard drives, and more. Extending Lochness to support new services is also a fairly simple process.

Installation
Quick setup from a template
Documentation

Installation

Just use pip

pip install lochness

For most recent DPACC-lochness

pip install git+https://github.com/PREDICT-DPACC/lochness

For debugging

cd ~
git clone https://github.com/PREDICT-DPACC/lochness
pip install -r ~/lochness/requirements.txt

export PATH=${PATH}:~/lochness/scripts  # add to ~/.bashrc
export PYTHONPATH=${PYTHONPATH}:~/lochness  # add to ~/.bashrc

Running test

Copy the token template, and add the information for each module.

cd lochness/tests
cp token_template_for_test_template.csv token_template_for_test.csv

Run test

bash run_test.sh

Setup from a template

Creating the template

Setting up lochness from scratch could be slightly confusing in the beginning. Try using the lochness_create_template.py to create a starting point.

Create an example template to easily structure the lochness system

# ProNET
lochness_create_template.py \
    --outdir /data/lochness_root \
    --studies PronetLA PronetSL PronetWU \
    --sources redcap xnat box mindlamp \
    --email kevincho@bwh.harvard.edu \
    --poll_interval 43200 \
    --ssh_host erisone.partners.org \
    --ssh_user kc244 \
    --lochness_sync_send \
    --s3

# PRESCIENT
lochness_create_template.py \
    --outdir /data/lochness_root \
    --studies PrescientAD PrescientME PrescientPE \
    --sources RPMS mediaflux mindlamp \
    --email kevincho@bwh.harvard.edu \
    --poll_interval 43200 \
    --ssh_host erisone.partners.org \
    --ssh_user kc244 \
    --lochness_sync_send \
    --s3

# For more options: lochness_create_template.py -h

Making edits to the template

Running one of the commands above will create the structure below

/data/lochness_root/
├── 1_encrypt_command.sh
├── 2_sync_command.sh
├── PHOENIX
│   ├── GENERAL
│   │   ├── PronetLA
│   │   │   └── PronetLA_metadata.csv
│   │   ├── PronetSL
│   │   │   └── PronetSL_metadata.csv
│   │   └── PronetWU
│   │       └── PronetWU_metadata.csv
│   └── PROTECTED
│       ├── PronetLA
│       ├── PronetSL
│       └── PronetWU
├── config.yml
├── lochness.json
└── pii_convert.csv

Change information in config.yml and lochness.json as needed.
Either manually update the PHOENIX/GENERAL/*/*_metadata.csv or amend the field names in REDCap / RPMS sources correctly for lochness to automatically update the metadata files.

Currently, lochness initializes the metadata using the following field names in REDCap and RPMS.

chric_subject_id: the record ID field name
- this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
chric_consent_date: the field name of the consent date
- this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
beiwe_id: the field name of the BEIWE ID.
xnat_id: the field name of the XNAT ID.
dropbox_id: the field name of the Dropbox ID.
box_id: the field name of the Box ID.
mediaflux_id: the field name of the Mediaflux ID.
mindlamp_id: the field name of the Mindlamp ID.
daris_id: the field name of the DaRIS ID.
rpms_id: the field name of the RPMS ID.

Encrypt the lochness.json by running

cd /data/lochness_root
bash 1_encrypt_command.sh

This encryption step creates a copy of encrypted keyrings to /data/lochness_root/.lochness.enc. To protect the sensitive keyring information in json, remove the lochness.json after running the encryption.

You can still extract keyring structure without sensitive information by running

lochness_check_config.py -ke /data/lochness_root/.lochness.enc

Set up REDCap Data Entry Trigger if using REDCap. Please see below "REDCap Data Entry Trigger capture" section.
Edit Personally identifiable information mapping table. Please seee below "Personally identifiable information removal from REDCap and RPMS data"

/data/lochness_root/pii_convert.csv

Run the sync.py or use the example command in 2_sync_command.sh

bash 2_sync_command.sh

To use `lochness_to_lochness` transfer through `aws s3`

Set up s3 bucket
Install aws CLI
Configure CLI with your s3 bucket information

$ aws configure

Add your AWS information to config.yml

AWS_BUCKET_NAME: ampscz-dev
AWS_BUCKET_ROOT: TEST_PHOENIX_ROOT

REDCap Data Entry Trigger capture

If your sources include REDCap and you would like to configure lochness to only pull new REDCap data, "Data Entry Trigger" needs to be set up in REDCap.

In REDCap,

"Project Setup"
"Enable optional modules and customizations"
"Additional customizations"
Check "Data Entry Trigger" and give address of the server including the port number e.g. http://pnl-t55-7.partners.org:9999

In order to use this functionality, the server where lochness is installed should be able to receieve HTTP POST signal from REDCap server. Which means it has to be either

lochness server is inside the same firewall as REDCap server. Or
lochness server has a open port that could listen to the REDCap POST signal.

After setting the "Data Entry Trigger" on REDCap settings, run below to update the /data/data_entry_trigger_db.csv real-time

# please specify the same port defined in the REDCap settings
listen_to_redcap.py --database_csv /data/data_entry_trigger_db.csv \
                    --port 9999

It would be useful to run listen_to_redcap.py in background, maybe inside a gnu screen so it runs continuously without interference.

Personally identifiable information removal from REDCap and RPMS data

A path of csv file can be provided, which has information about how to process each PII fields.

For example

/data/personally_identifiable_process_mappings.csv

pii_label_string | process
-----------------|---------------
address          | remove
date             | change_date
phone_number     | random_number
patient_name     | random_string
subject_name     | replace_with_subject_id

Any value from the field, with names that match to pii_label_string rows, the labelled PII processing method will be used to process the raw values to remove or replace the PIIs.

Documentation

You can find all the documentation you will ever need here

ztamayo/lochness

Lochness: Sync data from all over the cloud to a local directory

Table of contents

Installation

Running test

Setup from a template

Creating the template

Making edits to the template

To use `lochness_to_lochness` transfer through `aws s3`

REDCap Data Entry Trigger capture

Personally identifiable information removal from REDCap and RPMS data

Documentation

ztamayo/lochness

Lochness: Sync data from all over the cloud to a local directory

Table of contents

Installation

Running test

Setup from a template

Creating the template

Making edits to the template

To use lochness_to_lochness transfer through aws s3

REDCap Data Entry Trigger capture

Personally identifiable information removal from REDCap and RPMS data

Documentation

To use `lochness_to_lochness` transfer through `aws s3`