/onefl-deduper

Tools for EHR patient de-duplication (aka entity resolution)

Primary LanguagePythonMIT LicenseMIT

DOI

OneFL Deduper

Branch [Travis-CI] [Coveralls]
Master Build Status Coverage Status
Develop Build Status Coverage Status

Intro

Welcome to the OneFlorida "De-Duper" tool.

This tool genereates "Unique Identifiers" (UID's) used for patient de-duplication (aka "Entity Resolution", aka "Record Linkage").

The current implementation is using two CSV files as input for two separate scripts as described in the diagram below.

Note: The hashing process insures that "OneFlorida Domain" WILL NOT RECEIVE any data containing PHI.

    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    |   Partner Domain
    |
    |    (CSV file with PHI)                                (CSV file with no PHI)
    |   +--------------------------+                       +--------------------------+
    |   |   PHI_DATA.csv           | ----> hasher.py ----> |    HASHES.csv            |
    |   | patid, first, last,      |                       | patid, F_L_D_S, F_L_D_R  |
    |   | dob, sex, race           |                       |                          |
    |   +--------------------------+                       +--------------------------+
    |                                                            ||
    +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - || - - - - - -
    |   OneFlorida Domain                                        \/
    |                                                       +--------------------------+
    |                                                       | OneFlorida SFTP Server   |
    |                                                       +--------------------------+
    |                                                            ||
    |                                                            ||
    |                                                            \/
    |                                                       +--------------------------+
    |                                                       |   HASHES.csv             |
    |                                                       | patid, F_L_D_S, F_L_D_R  |
    |                                                       +--------------------------+
    |                                                            |
    |      ____________                                          |
    |    /              \                                        |
    |   |               /|                                      /
    |   |\_____________/ |                                     /
    |   |              | |  <------------- linker.py <--------
    |   |  UF Database | |
    |   |              |/
    |    \_____________/
    |
    |       (Links between hashes -> UUID's)
    |                                                             _____   O
    |       patid, partner_code, linkage_uuid, linkage_hash      / /     -+-
    |         123,          UFH,       abc...,       def...   <-- /       |
    |         456,          FLM,       abc...,       def...   <--        / \
    |         789,          FLM,       987...,       012...
    |
    |    (generate UID's from hashes)
    |
    + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Note on PHI: The hasher.py script uses the python implementation of the sha256 algorithm to scramblme the PHI in order to make it imposible to re-identify the patients. The sha256 algorithm is certified by the National Institute of Standards and Technology (NIST)

Installation

The two components of the application (hasher, linker) need proper configuration in order to function. For more details please refer to the docs/installation.md and dosc/installation-linker.md.

The format for the input file for the hasher component is described in the input-specs.md document.

References