Thank you for visiting the public repository for the Patient Outcome Prediction (POP) Project. This version of the application depends on a previous version of a dependency which needs to be updated. We are archiving this repository while we update our dependencies and add features. The repo will reupload with a more up-to-date version at a future date.
Thank you for your understanding and patience.
Welcome to POP - Patient Outcome Prediction Python Project.
Healthcare data can be challenging to work with and AWS customers have been looking for solutions to solve certain business challenges with the help of data and machine learning (ML) techniques. We have published an AWS Machine Learning Blog Post to show AWS customers how to Build patient outcome prediction applications using Amazon HealthLake and Amazon SageMaker. This repo is the open source project described in the blog post.
In this repo, we used a synthetic dataset from Synthea due to data license considerations. However, users may choose to use your own FHIR data sets following our modeling structure. Synthea is an open-source Git repository that allows HealthLake to generate FHIR R4-compliant resource bundles so that users can test models without using actual patient data.
This project has two majority components: backend (assets/model_backend/) and a frontend.
These two components are currently asynchronized - backend models update the prediction results on-demand and save them as files on S3; frontend periodically grabs new files and render them accordingly. No dependencies are assumed between these two components.
The backend of this solution uses AWS HealthLake (GA on 07/15/2021), which currently requires AWS CLI or Python API setup. CDK setup steps will be added once the service is officially supported by CDK.
In the AWS CLI, run the following command to create a data store, and pre-load a dataset from Synthea. The following command takes ~30 minutes for HealthLake to create the data store.
aws healthlake create-fhir-datastore --region us-east-1 --datastore-type-version R4 --preload-data-config PreloadDataType="SYNTHEA" --datastore-name pop-synthea
After the command being executed, keep a record of the data store ID from the response. We will refer to this id as .
This step is optional. You can run a sample query in the console or via CLI to get a sense of data loaded to your data store.
GET /datastore/<your-data-store-id>/r4/Procedure?code=371908008 HTTP/1.1
Host: healthlake.us-east-1.amazonaws.com
Content-Type: application/json
Authorization: AWS4-HMAC-SHA256 Credential=<your-credential>
As a HIPAA-eligible service, HealthLake requires data encryption at rest and in transit. When creating this S3 bucket, we need to enable Server-side encryption: Amazon S3 master-key (SSE-S3).
aws healthlake start-fhir-export-job \
--output-data-config S3Uri="s3://<your-data-export-bucket-name>" \
--datastore-id <your-data-store-id> \
--data-access-role-arn arn:aws:iam::<your-account-id>:role/<your-role>
First, manually create a virtualenv on MacOS and Linux. Once the virtualenv is activated, you can install the required dependencies.
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
At this point you can synthesize the CloudFormation template.
cdk synth
You can now begin exploring the source code, contained in the assets
directory.
There is also a very trivial test included that detects if all moldules are included in the CloudFormation template result:
pytest
Now, you can deploy the stacks.
cdk deploy
You will need to create a Glue data crawler so that the NDJSON files exported from HealthLake can be easily queried.
Follow the SQL instructured we provided in the sql.SQL file in the notebook folder, and run them in Athena.
After the SQL queries executed, you should be able to get a table named pop_main
.
cdk ls
list all stacks in the appcdk synth
emits the synthesized CloudFormation templatecdk deploy
deploy this stack to your default AWS account/regioncdk diff
compare deployed stack with current statecdk docs
open CDK documentation
Model training and inference are in folder assets/modeling/
The structure is as follows:
├── README.md <- The summary file of this project
├── data <- A temporary location to hold data files, not in CodeCommit
├── docs <- The Sphinx documentation folder
├── ml <- Source code location that contains all the machine learning modules
│ ├── config.py <- This file defines contant variables
│ ├── data.py <- This file contains helper functions reading from s3, loading embedding matrix, df->tensors etc
│ ├── evaluation.py <- Evaluation functions used by model training
│ ├── evaluation_tf.py <- Evaluation functions specifically for TensorFlow framework
│ ├── requirements.txt <- Requirements for SageMaker hosted container
│ └── visualize.py <- Visualization function to analyze the data
├── models <- A temporary location to store the embedding model and trained ml models, not in CodeCommit
├── notebooks <- A folder containing notebook experiments
└── requirements.txt <- The project dependencies. Run `pip install -r requirements.txt` before you start