/mdm

Python API for Aidbox MDM module

Primary LanguageJupyter Notebook

Python API for Aidbox MDM module

MDM module helps you deduplicate data in your Aidbox. This repository has 2 Python modules:

  • aidbox — module for communication with Aidbox
  • splink — fork of splink with support for Aidbox

Installation

You need to have Python 3, poetry, and Jupyter.

Then run

poetry install
poetry shell
python -m ipykernel install --user --name aidobx --display-name Aidbox

Usage

Connection to Aidbox

Create Aidbox connection:

import aidbox
box = aidbox.Aidbox('https://base-url', 'client-id', 'client-secret')

Check connection:

box.check()

Creating MDM model

Create empty MDM model:

import aidbox.mdm as mdm

model = mdm.Model('ResourceType')

Set up fields to extract in MDM table:

model['first_name'] = ['name', 0, 'given', 0]
model['last_name'] = ['name', 0, 'family']

Set up term frequencies for needed fields:

model.enable_frequencies('first_name')

Apply model to create MDM table in Aidbox:

model.apply(box)

Learning model weights

See splink documentation to learn how to use splink. This guide shows only differences needed for data linkage with Aidbox.

Change id column from unique_id to id:

settings = {
    # ...
    'unique_id_column_name': 'id',
    # ...
}

Create linker

linker = PostgresLinker(model, box, settings)

Splink caches intermediate results. If you want to start from scratch (e.g. your data has changed), use

linker.drop_splink_tables()

Train model as usual

Export model as zen-lang edn file for Aidbox configuration project

linker.save_zen_model_edn('filename.edn')