- Overview
- Prerequisites
- Deployment Steps
- Deployment Validation
- Next Steps
- Cleanup
- FAQ, known issues, additional considerations, and limitations
- Revisions
- Notices
- Authors
This Guidance leverages AWS Entity Resolution service to perform entity resolution for patient resources stored in AWS HealthLake (AHL) data store. The implementation of entity resolution within AHL serves as a critical foundational element ensuring data integrity across the data store. An "entity" in this context can denote a singular patient, provider, organization, or healthcare facility. Entity resolution is the pivotal process of determining whether multiple records within AHL pertain to the same real-world object, such as a patient or provider. For example, our healthcare customers have told us that they are challenged by matching patients across data sources that originate from multiple internal systems or even multiple organizations (e.g. Patient records coming from HL7 ADT data, C-CDA documents, and PDMP pharmacy data).
Using the AWS Entity Resolution service, this Guidance addresses the challenge by employing machine learning-based matching algorithm to accurately identify and link disparate patient records, enhancing AHL's ability to establish comprehensive patient profiles with confidence scores, thus ensuring accurate and cohesive healthcare data management. This process is one of the required steps in the broader processes known as Master Data Management (MDM), or Enterprise Master Patient Index (EMPI).
The following diagram describes the architecture of this patient entity resolution solution, which leverages AWS native services that align with the AWS Well-Architected Framework, ensuring robust architecture across key dimensions, such as security, reliability, performance efficiency, and cost optimization.
This solution includes the following high-level steps and AWS native services:
- Fetch patient identifier information out of the AHL data store using Amazon Athena SQL query.
The Amazon Athena query runs against the AWS Lake Formation resource link database, which is automatically created inside a HealthLake service account. The query result dataset is saved in a S3 bucket as a CSV file. The identifier attributes of the patient resources used for query could include attributes like name, address, phone number, date of birth, and gender.
- Present the patient dataset to AWS Entity Resolution.
Once the patient dataset has been created in the previous step, we will use an AWS Glue Crawler to crawl the dataset and populate an AWS Glue Data Catalog table. This table will then be ready for ingestion into the AWS Entity Resolution service.
- Generate ML-driven matches with AWS Entity Resolution.
An Entity Resolution schema mapping and a matching workflow have been created in this solution to define how to match the input patient data and where to write the match results. By default, this solution uses the pre-configured machine learning-based matching technique to find matches across the input patient dataset. A Lambda function triggers a job of the matching workflow and writes the results, with Entity Resolution match ID and confidence level, to another S3 bucket. You can also use the rule-based matching technique in the matching workflow to define your own matching rules and find exact matches that meet your entity resolution requirements.
- Insert Entity Resolution match-IDs into the AHL patient resources.
Once AWS Entity Resolution has identified matching patient records, the solution will use a Lambda function to read and parse the Entity Resolution results, and insert the match-IDs with a pre-defined high confidence level back into the patient resources as new identifier attributes. This will allow you to easily identify and link matching patient records across your AHL data store.
You are responsible for the cost of the AWS services used while running this Guidance. As of May, 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) Region is approximately $6.73 per month for processing 2000 records.
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.
The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.
AWS service | Dimensions | Cost [USD] |
---|---|---|
AWS Step Fuction | 4,000 state transitions per month | $ 0.00 |
AWS Lambda | One million free requests per month and 400,000 GB-seconds of compute time per month | $ 0.00 |
Amazon Athena | $5.00 per TB of data scanned | $ 7.35 |
AWS Lake Formation | $1.00 per 100,000 S3 objects, per month | $ 0.00 |
AWS Glue | Free for the first million objects stored | $ 0.07 |
AWS Entity Resolution | $0.25 per 1,000 records processed | $ 0.50 |
Amazon S3 | $0.023 per GB for the first 50TB, per month | $ 0.23 |
To implement this Guidance, you can deploy this AWS CloudFormation template. Before deploying this template, you will need the following information to use as input parameters to the template:
- The data store ID of a HealthlLake data store that you want to use for patient entity resolution.
- The database name and the shared resource owner ID (or catalog ID) of the AWS LakeFormation database that is linked to the AWS HealthLake data store.
This Guidance is available in the US East (N. Virginia), US West (Oregon), US East (Ohio), and Asia Pacific (Mumbai) regions
To implement this Guidance, you can deploy this AWS CloudFormation template. Before deploying this template, you will need the following information to use as input parameters to the template:
- The data store ID of a HealthlLake data store that you want to use for patient entity resolution.
- The database name and the shared resource owner ID (or catalog ID) of the AWS LakeFormation database that is linked to the AWS HealthLake data store.
The output of this template includes an AWS Step Function state machine, such as ahl-entity-resolution-state-machine
. You can execute this state machine on demand to run the solution and perform patient entity resolution for your AHL data store. This template also creates an AWS EventBridge Scheduler to automatically trigger the state machine regularly, such as every night at 10 o’clock. You can modify the schedule of this scheduler to run the solution based on your business needs.
We also include the python source codes for all the Lambda functions orchestrated by the state machine in the lambda folder of this repo.
To validate the matched patient records identified by this solution, you can either:
- Go to the AWS CloudWatch Log Group linked to the Step Function. The log group contains detailed information about the execution of the Step Function, including the input and output of each step.
- Go to the Execution page of the Step Function and check the Output of the last step of the state machine. The last step of the state machine generates the match results, which include the matched patient resource IDs (as
source_id
) and thematch_id
returned by the Entity Resolution service.
Once you have identified the patient resource IDs from the AWS Entity Resolution matching output, you can go to the AHL data store to query the patient resource by using patient resource IDs identified above. You will be able to see that a new identifier attribute is created for the patient from AWS Entity Resolution with the match_id showing as the identifier attribute value.
The match ID returned from AWS Entity Resolution will remain the same for a source patient record across multiple workflow runs, unless you change the matching workflow configuration or the patient record is significantly updated.
This Guidance provides a baseline for your patient entity resolution solution in HealthLake. It is a flexible and extensible framework that you can use to build your own applications and workloads on top of it. You can enhance or modify the solution to meet your specific healthcare entity resolution requirements.
Please ensure that you clean up resources so that your account is not charged after the workshop. If you are using an AWS Workshop Studio account, the account will be terminated and cleanup will occur automatically. If you are building this in your own AWS account, please follow these instructions to terminate and clean up resources:
- S3 bucket - Empty and delete the S3 bucket that you created to upload the sample patient file.
- HealthLake Data Store - Delete the HealthLake data store that you created for this workshop. The steps to delete data store can be found here
- CloudFormation stack - Delete the CloudFormation stack created for this workshop. The steps to delete the stack can be found here
The match ID returned from AWS Entity Resolution are for correlating internal patient records within HealthLake data stores and should not be used as identifiers outside HealthLake in downstream or external systems.
Contributions are more than welcome. Please read the code of conduct and the contributing guidelines.