This repository showcases a simple containerized workflow to semantize synthetic patient data using the SPHN framework. It is designed to run on biomedical systems with the following restrictions:
- No internet acccess besides a private container registry
- Only
podman
and nextflow available - No root access
- keep data provenance information separate from the git repository
The project is structured to run individual tasks in their own podman container. Nextflow is used as the workflow manager and event-based workflow executor.
The workflow processes simulated patient data from synthea in JSON format and generates an RDF graph describing patient healthcare appointments (patient, dates and institution). It then validates the resulting graph. In addition to nextflow's native logging capabilities, the workflow produces an interoperable semantic log
in JSON-LD format for traceability.
The data is semantized using the SPHN ontology. Mapping rules are defined in human readable YARRRML format (see data/mappings.yml). The triples are materialized using containerized tools from rml.io. The graph validation is done using pySHACL with the SPHN shacl shapes. An interoperable log is defined in conf/log.conf
and is obtained by formatting Nextflow's log using JSON-LD format.
The workflow definition can be found in main.nf and its configuration in nextflow.config.
flowchart TD
p0(( ))
p1[cat_patients_json]
p2(( ))
p3[convert_mappings]
p4[generate_triples]
p5(( ))
p6(( ))
p7[validate_shacl]
p8(( ))
p9[gzip_triples]
p10(( ))
p0 -->|json_dir| p1
p1 --> p4
p2 -->|yml_mappings| p3
p3 --> p4
p4 -->|graph| p7
p5 -->|ontology| p7
p6 -->|shapes| p7
p7 --> p8
p4 -->|graph| p9
p9 --> p10
First clone the repository and move into the folder:
git clone https://github.com/SDSC-ORD/demo_biomedit_workflow.git && cd demo_biomedit_workflow
To interact with the workflow for development or production, we use different Nextflow profiles as follows:
nextflow run -profile standard main.nf
: Run the workflow using the workflow file in the current directory and publicly available containers defined inconf/containers.yaml
. This is the default profile, and the-profile
option can therefore be omitted.nextflow run -profile prod main.nf
: Run the containerized workflow using the latest commit on the repository remote and containers from the private registry.
Tip
A helper script (scripts/migrate_images.sh
) is provided to automatically migrate images to a custom registry/namespace.
It can read container declarations in a nextflow config file and handle the pull/tag/push of images.
By default, the workflow will be executed on each zip file present in the input directory.
When the option --listen=true
is provided, the workflow manager will instead listen continuously for filesystem events and trigger execution whenever a new zip file appears in the input directory. In this mode, a log will only be generated when the Nextflow execution is manually interrupted, meaning that the exit code in the log will always be 1.
This repository includes a small set of example data in data/raw/test_patients.zip
. If you'd like to use a larger dataset as input, we included scripts/download_data.sh
to retrieve a synthetic dataset from Synthea and put it directly in the input folder data/raw/
.
The data/out
is structured as follows:
data/out
├── logs
│ └── 2024-04-24T11:32:52.980140+02:00_infallible_kilby_logs.json
├── reports
│ ├── test_patients_1_report.ttl
│ └── test_patients_2_report.ttl
└── triples
├── test_patients_1.nt.gz
└── test_patients_2.nt.gz
Where triples
contains the semantized data for each input archive, and reports
contains the shacl validation report, indicating any violation of the schema constraints.
For each workflow run, a time-stamped log file with a unique name is also saved in logs
.
The code in this repository is licensed under GPLv3. The SPHN ontology and shapes files included in this repository are redistributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. The SPHN ontology can be explored on the BioMedIT website, and the shapes and ontology files were retrieved from the SHACLer repository.