Medical records are full of confidential and highly sensitive patient information, which are required to be removed if they are to be used for secondary purposes beyond direct patient care.
This project aims to remove and mask the personal data from within patient records whilst retaining the top level meaning of the redacted information.
For example:
"John suffers from Epilepsy. He was born on 01/01/1900"
"[Patient first name] suffers from Epilepsy. He was born on [Date of birth]"
This repository is currently incomplete and is not for production use.
A brief overview of the categories of personal information redacted.
Value | Description |
---|---|
Names | First and last names and any abbreviations |
Dates | All dates |
Contact details | Addresses and unique identifiers |
Health care identifiers | Numbers which may identify an individual |
The structure of the created terminology:
Each annotation will be combined with "Meta-annotations". i.e. an annotation of an annotation. This will hold the contextual information of the concept.
Meta-annotation | Values | Description |
---|---|---|
Subject | "Patient" "Relative" "Health care Provider" "Other/ N/A" |
Who is the subject of the identified concept? |
This project leverages the application of the CogStack/MedCAT packages. To create a training dataset
For further information on the MedCAT tool is available here.
There are two essential components of the MedCAT model required for this project.
-
Vocab
-
Concept Database (CDB)
A training data set is created using the MedCATtrainer platform This step labels all identifiable information from Medical records.
Currently in progress
Currently in progress