/artificial-data-generator

Pipelines for generating large volumes of anonymous artificial data that share some of the characteristics of real NHS data

Primary LanguagePythonMIT LicenseMIT

Artificial Data Generator

Pipelines for generating large volumes of anonymous artificial data that share some of the characteristics of real NHS data.

This material is maintained by the NHS Digital Data Science team.

See our other work here: NHS Digital Analytical Services.

To contact us raise an issue on Github or via email and we will respond promptly.

Overview

What is artificial data?

Artificial data is an anonymous representation of real data

  • Artificial data provides an anonymous representation of some of the properties of real datasets.
  • Artificial data preserves the formatting and structure of the original dataset, but may otherwise be unrealistic.
  • Artificial data reproduces some of the statistical properties and content complexity of fields in the real data, while excluding cross-dependencies between fields to prevent risks of reidentification.
  • Artificial data is completely isolated from any record-level data.
  • It is not possible to use artificial data to reidentify individuals, gain insights, or build statistical models that would transfer onto real data.

How is artificial data generated?

There are three stages involved in generating the artificial data:

  1. The Metadata Scraper extracts anonymised, high-level aggregates from real data at a national level.
    • At this stage key identifiers (such as patient ID) are removed and small number suppression is applied in order to prevent reidentification at a later stage.
  2. The Data Generator samples from the aggregates generated by the Metadata Scraper on a field-by-field basis and puts the sampled values together to create artificial records.
  3. Postprocessing tweaks the output of the Data Generator to make the data appear more realistic (such as swapping randomly generated birth and death dates to ensure sensible ordering). This also includes adding in randomly generated identifying fields (such ‘patient’ ID) which were removed at the Metadata Scraper stage.

How do we ensure artificial data is anonymous?

Between steps 1 and 2 above, a manual review is performed to check for any accidental disclosure of personally identifiable information (PII) in the metadata. The review is carried out against a checklist that has been approved by the Statistical Disclosure Control Panel at NHS Digital, chaired by the Chief Statistician. The outcomes of the review are signed-off by a senior manager.

The data generator in step 2 uses only reviewed / signed off metadata and is completely isolated from any record-level data in the original dataset.

Dependencies & environment

The code is designed to be run within the Databricks environment.

Warning

Python files represent Databricks notebooks, not python scripts / modules!

This means things like imports don't necessarily work how you may expect!

The codebase was developed and tested on Databricks Runtime 6.6. We have packaged up the dependencies with the code, so the code should run on Databricks Runtime 6.6 without installing additional packages.

Note

We have plans to pull out the core logic into a Python package to make it reusable by others (outside of Databricks), but we're not there yet!

Look out for future updates, or feel free to reach out to us via email and we'd be happy to talk.

Repo structure

Top-level structure

The repo has the following structure as viewed from the top-level

root
|-- projects                                # Code Promotion projects (see full description below)
|   |-- artificial_hes                      # For generating artificial HES data
|   |
|   |-- artificial_hes_meta                 # For scraping HES metadata
|   |
|   |-- iuod_artificial_data_generator      # Reusable code library & dataset-specific pipelines
|   |
|   |-- iuod_artificial_data_admin          # For managing reviewed metadata
|
|-- docs                                    # Extended documentation for users
|
|-- notebooks                               # Helper notebooks
|   |-- admin                               # Admin helpers 
|   |-- user                                # User helpers
|   
|-- utils                                   # Databricks API helper scripts

Reusable logic

The common logic shared across different pipelines is stored within projects/iuod_artificial_data_generator/notebooks.

  • The entry-points for scraping metadata are the driver.py files within scraper_pipelines.
  • The entry-points for generating artificial data are the driver.py files within generator_pipelines
  • The remaining notebooks / folders in this directory store reusable code

Note: in the NHS Digital Databricks environment, the driver notebooks are not triggered directly - rather they are executed as ephemeral notebook jobs by the run_notebooks.py notebooks in one of the projects (for example, artificial_hes_meta executes the driver notebook within scraper_pipelines/hes). See below for more details on Code Promotion.

Code Promotion projects

The children of the projects directory follow the 'Code Promotion' project structure, which is specific to the NHS Digital 'Code Promotion' process. For more information on Code Promotion, see the subsection below.

For each dataset, there are two 'Code Promotion' projects:

  • One to extract metadata (for example artificial_hes_meta)
  • One to publish the generated artificial data (for example artificial_hes)

There are two, general projects that act across datasets

  • iuod_artificial_data_generator is responsible for generating artificial data for a specified dataset.
  • iuod_artificial_data_admin is used to move metadata between access-scopes (specifically from the sensitive to non-sensitive scope after review and sign-off)

What is Code Promotion?

Code Promotion is a process designed by Data Processing Services (DPS) at NHS Digital to allow users of Databricks in DAE to promote code between environments and run jobs on an automated schedule.

Jobs inside a code Code Promotion project has read/write access to the project's own database (which shares the project's name) and may have read or read/write access to possibly a number of other databases or tables.

Code Promotion structure

Each Code Promotion project must adhere to the following structure.

{project_name}
|-- notebooks           # Library code called by run_notebooks.py
|
|-- schemas             # Library code called by init_schemas.py
|
|-- tests               # Tests for notebooks
|
|-- cp_config.py        # Configures the Databricks jobs (a strict set 
|                       # of variables must be defined based on the Code 
|                       # Promotion specification)
|
|-- init_schemas.py     # Sets up / configures the database associated to the project
|
|-- run_notebooks.py    # Entry-point for the main processing for the project
|
|-- run_tests.py        # Run all the tests, executed during the build process
|

Code Promotion jobs

Within Databricks, each Code Promotion project is associated with three jobs which trigger the init_schemas.py, run_tests.py and run_notebooks.py notebooks. These jobs have a specific set of permissions to control their access scopes - the databases they can select from, modify and so on. This is why driver notebooks are not triggered directly within Databricks (as per the note above), because the jobs are set up to have the necessary set of permissions to perform the tasks they are designed to do.

We have designed the jobs and their permissions in such a way to completely isolate the access scopes for pipelines that scrape metadata from those that generate artificial data. It is not possible for the pipelines that generate artificial data to read from any sensitive data.

Utilities

There are three scripts in the utils folder which allow developers to sync changes to the Code Promotion projects iuod_artificial_data_generator, iuod_artificial_data_admin, artificial_hes and artificial_hes_meta across environments. Users will need to set up the Databricks CLI in order to use these scripts.

  • export.ps1: exports all workspaces in Databricks from staging/dev to projects in your local environment. The script will stash local changes before exporting. The stash is not applied until the user calls 'git apply stash'.
  • import.ps1: imports all directories in your local version of projects to staging/dev in Databricks. Importing will overwrite the version of any of these projects in staging, so the script includes user warnings and confirmation to prevent accidental overwriting. If the overwrite is intentional then the user will need to confirm by typing the project name.
  • list_releases.ps1: returns the most recent version of each Code Promotion project in code-promotion.

Setting up the Databricks CLI

We use the Databricks API to import / export code to / from our development environment. In your environment, install the Databricks CLI using

pip install databricks-cli

Then setup authentication using

databricks configure --token

You will be prompted to enter the host and your Databricks personal access token.

If you have not previously generated a Databricks personal access token: in Databricks go to Settings > User Settings > Access Tokens and click Generate New Token. Make note of the token somewhere secure, and copy it into the prompt.

Licence

The Artificial Data Generator codebase is released under the MIT License.

The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.