/lochness

Download your data to a data lake.

Primary LanguagePythonOtherNOASSERTION

Lochness: Sync data from all over the cloud to a local directory

Lochness is a data management tool designed to periodically poll and download data from various data archives into a local directory. This is often referred to as building a "data lake" (hence the name).

Out of the box there is support for pulling data down from Beiwe, XNAT, REDCap, Dropbox, external hard drives, and more. Extending Lochness to support new services is also a fairly simple process.

Table of contents

  1. Installation
  2. Quick setup from a template
  3. Documentation

Installation

Just use pip

pip install lochness

For most recent DPACC-lochness

pip install git+https://github.com/PREDICT-DPACC/lochness

For debugging

cd ~
git clone https://github.com/PREDICT-DPACC/lochness
pip install -r ~/lochness/requirements.txt

export PATH=${PATH}:~/lochness/scripts  # add to ~/.bashrc
export PYTHONPATH=${PYTHONPATH}:~/lochness  # add to ~/.bashrc

Running test

  • Copy the token template, and add the information for each module.
cd lochness/tests
cp token_template_for_test_template.csv token_template_for_test.csv
  • Run test
bash run_test.sh 

Setup from a template

Creating the template

Setting up lochness from scratch could be slightly confusing in the beginning. Try using the lochness_create_template.py to create a starting point.

Create an example template to easily structure the lochness system

# ProNET
lochness_create_template.py \
    --outdir /data/lochness_root \
    --studies PronetLA PronetSL PronetWU \
    --sources redcap xnat box mindlamp \
    --email kevincho@bwh.harvard.edu \
    --poll_interval 43200 \
    --ssh_host erisone.partners.org \
    --ssh_user kc244 \
    --lochness_sync_send \
    --s3

# PRESCIENT
lochness_create_template.py \
    --outdir /data/lochness_root \
    --studies PrescientAD PrescientME PrescientPE \
    --sources RPMS mediaflux mindlamp \
    --email kevincho@bwh.harvard.edu \
    --poll_interval 43200 \
    --ssh_host erisone.partners.org \
    --ssh_user kc244 \
    --lochness_sync_send \
    --s3

# For more options: lochness_create_template.py -h

Making edits to the template

Running one of the commands above will create the structure below

/data/lochness_root/
├── 1_encrypt_command.sh
├── 2_sync_command.sh
├── PHOENIX
│   ├── GENERAL
│   │   ├── PronetLA
│   │   │   └── PronetLA_metadata.csv
│   │   ├── PronetSL
│   │   │   └── PronetSL_metadata.csv
│   │   └── PronetWU
│   │       └── PronetWU_metadata.csv
│   └── PROTECTED
│       ├── PronetLA
│       ├── PronetSL
│       └── PronetWU
├── config.yml
├── lochness.json
└── pii_convert.csv
  1. Change information in config.yml and lochness.json as needed.

  2. Either manually update the PHOENIX/GENERAL/*/*_metadata.csv or amend the field names in REDCap / RPMS sources correctly for lochness to automatically update the metadata files.

    Currently, lochness initializes the metadata using the following field names in REDCap and RPMS.

  • chric_subject_id: the record ID field name
    • this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
  • chric_consent_date: the field name of the consent date
    • this field name must be in the REDCap or RPMS repository for the metadata to be updated by lochness.
  • beiwe_id: the field name of the BEIWE ID.
  • xnat_id: the field name of the XNAT ID.
  • dropbox_id: the field name of the Dropbox ID.
  • box_id: the field name of the Box ID.
  • mediaflux_id: the field name of the Mediaflux ID.
  • mindlamp_id: the field name of the Mindlamp ID.
  • daris_id: the field name of the DaRIS ID.
  • rpms_id: the field name of the RPMS ID.
  1. Encrypt the lochness.json by running
cd /data/lochness_root
bash 1_encrypt_command.sh

This encryption step creates a copy of encrypted keyrings to /data/lochness_root/.lochness.enc. To protect the sensitive keyring information in json, remove the lochness.json after running the encryption.

You can still extract keyring structure without sensitive information by running

lochness_check_config.py -ke /data/lochness_root/.lochness.enc
  1. Set up REDCap Data Entry Trigger if using REDCap. Please see below "REDCap Data Entry Trigger capture" section.

  2. Edit Personally identifiable information mapping table. Please seee below "Personally identifiable information removal from REDCap and RPMS data"

/data/lochness_root/pii_convert.csv

  1. Run the sync.py or use the example command in 2_sync_command.sh

bash 2_sync_command.sh

To use lochness_to_lochness transfer through aws s3

  1. Set up s3 bucket
  2. Install aws CLI
  3. Configure CLI with your s3 bucket information

$ aws configure

  1. Add your AWS information to config.yml
AWS_BUCKET_NAME: ampscz-dev
AWS_BUCKET_ROOT: TEST_PHOENIX_ROOT

REDCap Data Entry Trigger capture

If your sources include REDCap and you would like to configure lochness to only pull new REDCap data, "Data Entry Trigger" needs to be set up in REDCap.

In REDCap,

  • "Project Setup"
  • "Enable optional modules and customizations"
  • "Additional customizations"
  • Check "Data Entry Trigger" and give address of the server including the port number e.g. http://pnl-t55-7.partners.org:9999

In order to use this functionality, the server where lochness is installed should be able to receieve HTTP POST signal from REDCap server. Which means it has to be either

  • lochness server is inside the same firewall as REDCap server. Or
  • lochness server has a open port that could listen to the REDCap POST signal.

After setting the "Data Entry Trigger" on REDCap settings, run below to update the /data/data_entry_trigger_db.csv real-time

# please specify the same port defined in the REDCap settings
listen_to_redcap.py --database_csv /data/data_entry_trigger_db.csv \
                    --port 9999

It would be useful to run listen_to_redcap.py in background, maybe inside a gnu screen so it runs continuously without interference.

Personally identifiable information removal from REDCap and RPMS data

A path of csv file can be provided, which has information about how to process each PII fields.

For example

/data/personally_identifiable_process_mappings.csv

pii_label_string | process
-----------------|---------------
address          | remove
date             | change_date
phone_number     | random_number
patient_name     | random_string
subject_name     | replace_with_subject_id

Any value from the field, with names that match to pii_label_string rows, the labelled PII processing method will be used to process the raw values to remove or replace the PIIs.

Documentation

You can find all the documentation you will ever need here