- Ability to load Bulk FHIR resources for analysis.
- Value set creator that works with SNOMED subsumption with inclusion, exclusion and attribute constraints.
- Data analysis tool which finds cohorts of patients with given conditions, procedures and/or prescriptions.
- Statistical testing to measure how a procedure or drug affects the chances of a subsequent disorder.
- Demo data generator which creates a large population of patients including disease associations and conditions.
SNOMED CT’s distinguishing feature is its logical framework. Unlike ICD terms which are not based on Description Logic, SNOMED features the ability to do logical subsumption searches to find cohorts of patients, medications, or conditions.
Every kind of quality measure, outcomes measurement or retrospective research involving cohorts of patients depends on defining lists of clinical terms sometimes called "value sets".
When these types of data analysis are done with ICD or other non-SNOMED terminologies, the creation of value sets requires human experts and is prone to errors. The humans creating the value sets must remember all the possible ways to refer to a condition, and ICD does not allow more than one parent, so Viral Pneumonia must be either a respiratory disease or an infectious disease but can’t be both.
In SNOMED because of its logical structure, you can search hierarchies and do logical role based searches. This allows you to create value sets without arbitrary knowledge of which synonyms are used to describe a term.
These data analysis tasks require three steps that we can define as a High Value Implementation of SNOMED.
First an organization must be able to create value sets by using SNOMED subsumption searches. This task is independent of any patient data. If the patient data is linked to SNOMED directly, then the cohort of patients can be found without any further processing. But if the EMR uses a non-SNOMED interface terminology, then there must be an intermediate step mapping the SNOMED value set to the interface terms. For the purposes of our demo tool, we will use SNOMED terms directly linked to patient data.
These principles are difficult to explain to clinicians, in order to demonstrate these advantages to them, it is necessary to have a tool that is populated with SNOMED coded patient data. With such a tool we can highlight the advantages of SNOMED compared to ICD (or any other terminology).
Real patient data (a large population linked to SNOMED) is practically impossible to get for a variety of reasons. Real data would contain expected associations between diseases, events, medications and conditions. For example, real data would show that fractured bones were more common after a motor vehicle accident. Real data would show that infections were more common after an immunosuppressive medication were given. Real data would also show these and other known links between conditions. Other obvious examples are patients with COPD would have pneumonia and bronchitis more often than patients in general. Diabetics would have more foot ulcers and more peripheral neuropathy.
This demo tool is a real tool. If we were to get some real patient data and load it into the tool then we could explore for real associations.
Patient data in Bulk FHIR resource (NDJSON) format can be loaded. The following resource types are supported:
- Patient
- Condition
- Procedure
- MedicationRequest
These are all mapped to the simple internal data model taking the start date and first SNOMED CT code to create a Clinical Event. Only confirmed, active resources are loaded. For example if a Condition has a verificationStatus
using the http://terminology.hl7.org/CodeSystem/condition-ver-status
system and the value is not confirmed
then it will be ignored.
A frontend web application for this API is available: health-data-analytics-ui.
The data model is very simple:
- Patient
- roleId
- dob
- dobYear (for optimisation)
- gender (MALE / FEMALE)
- event count
- events (many Clinical Events)
- Clinical Event
- date
- conceptId (SNOMED CT Concept Identifier)
A clinical event could represent an observation, finding, drug prescription or procedure depending on the SNOMED CT concept used.
- Java 8 or later
- 4G of memory
- SNOMED CT release RF2 archive
- Patient data (can be generated)
Build the project using maven.
mvn clean install
Extract the SNOMED CT archive to a directory named release
in the root of the project. Only the Snapshot files will be used so any others can be removed if needed.
If you would like to use synthetic data there are two ways to generate a patient population to load into the analytics API. Synthea Patient Generator to generate FHIR Bulk resources or the built in Patient Generator.
The Synthea project can be used to generate healthcare data in Bulk FHIR format which can be loaded into this API. Synthea has more than 50 modules which contain rules derived from real data to help to generate realistic healthcare data.
Synthea has on online Module Builder which can be used to edit or create new modules to simulate specific scenarios. SNOMED International donated an enhancement to the Module Builder to allow SNOMED CT Value Sets to be used in module states. Using this feature creates richer patient data because a concept within the value set is used rather than the same concept every time. The bulk data and terminology service options should be enabled when generating data.
A few Synthea modules have been created to support SNOMED CT demonstration scenarios.
Alternatively a synthetic patient population can be generated using the build in generator
module. This option supports only a small number of scenarios but is much more performant.
From the root of the project the following command will generate ~1,200,000 patients, which takes ~1 minute.
java -Xms3g -jar generator/target/generator*.jar
Command options:
--population-size
Optional. Defaults to 1,248,322.
The generated population will be written in native bulk NDJSON format to a directory named patient-data-for-import
.
The Data Analytics API is a Java application using Spring Boot with Swagger API documentation.
The server requires a standalone Elasticsearch deployment. Elasticsearch can be run locally. There are also hosted solutions available from AWS and Elastic.co. The Elasticsearch server must be version 6.x, version 7.x will not work. We recommend the latest 6.8.x patch release. https://www.elastic.co/downloads/past-releases#elasticsearch
Once Elasticsearch is running patient data can be imported into the server from either FHIR or native format.
- Import FHIR Bulk resources from a directory containing the
.ndjson
files:
java -Xms3g -jar server/target/server*.jar --import-population-fhir='my-documents/fhir-resources'
Or
- Import native bulk resources:
java -Xms3g -jar server/target/server*.jar --import-population='patient-data-for-import'
Or
- Import single file FHIR resources:
java -Xms3g -jar server/target/server*.jar --import-population-fhir-single='patient-data-for-import'
The optional parameter "--import-fhir-version" specifies the FHIR version. Currently, version 3 ("dstu3") and 4 ("r4") are supported. The default is "r4". Each file has to contain a "Bundle" or "Collection" resource. Import of 10,000 patients (roughly 25GB data) from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QDXLWR takes about 20min on an I7 notebook.
The program will exit when all patient data has been consumed.
Running the import again will add to the existing data. To delete existing patient data make an HTTP delete request to the Elasticsearch patient index before starting the import:
curl -XDELETE localhost:9200/patient
If CPT codes are loaded cost information can be provided for the procedures within the records of a selected cohort of patients.
Within the application directory create a directory named cpt-codes
containing the files cpt-codes.txt
and snomed-cpt-map.txt
.
These files will be loaded when the application starts.
For examples of these files see dummy-cpt-codes.
Notice: No real CPT codes are included within this application. A few fictitious codes are used for unit testing purposes.
Once the patient data has been loaded run the server without the import argument:
java -Xms3g -jar server/target/server*.jar
Once the server is started the API and documentation will be available here: http://localhost:8080/health-analytics-api/
The following data stores will be created when the server starts.
snomed-index
directory contains a Lucene index of the SNOMED CT release to provide semantic information for the server.
The server should be stopped before removing this data store.
Patient data can be loaded in realtime by implementing HealthDataIngestionSource interface. For example a class could be added which receives patient record updates over JMS or polls a directory.