SynthGauge is a Python library providing a framework in which to evaluate synthetically generated data.
Warning
The synthgauge
package is no longer actively developed by the Data Science
Campus, and its GitHub repository has been archived. You can still install
synthgauge
from GitHub and PyPI.
We also recommend the package sdmetrics
from the Synthetic Data Vault for all of your synthetic
data evaluation needs.
The library provides a range of metrics and visualisations for assessing and
comparing distributions of features between real and synthetic data. At its
core is the Evaluator
class, which provides a consistent interface for
assessing two sets of data. By creating several Evaluator
instances, you can
easily evaluate synthetic data generated from a range of methods in a
consistent and comparable manner.
🔒 vs. 📊
When generating synthetic data, there is generally a trade-off between privacy (i.e. keeping sensitive information private) and utility (i.e. ensuring the dataset is still fit for purpose).
The metrics included in SynthGauge fall into these categories and work is continuing to add more metrics.
SynthGauge is a toolkit providing metrics and visualisations that aid the user in the assessment of their synthetic data.
SynthGauge is not going to make any decisions on behalf of the user. It won’t specify if one synthetic dataset is better than another. This decision is dataset- and purpose-dependent so can vary widely from user to user.
Simply, SynthGauge is a decision-support tool, not a decision-making tool.
The synthgauge
package is available on PyPI and can be installed via
pip
in the standard way:
$ python -m pip install synthgauge
If you'd rather install the package from source, you can do so by first cloning
this repository from GitHub. The synthgauge
package is configured using
setup.cfg
, which requires newer versions of pip
, setuptools
and wheel
.
Be sure to update these if you haven't for a while.
$ cd /path/to/synthgauge
$ python -m pip install --upgrade pip setuptools wheel
$ python -m pip install .
Now you're ready to start using the package!
To help users get acquainted with the package, an example Jupyter Notebook is
included in the 📂 examples
directory. This notebook is
also available in the package documentation.
The following shows an example workflow for evaluating a single real-synthetic dataset pair.
>>> import synthgauge as sg
>>>
>>> # 1. Create or read in some data as a `pandas.DataFrame`
>>> real = sg.datasets.make_blood_types_df(noise=0, nan_prop=0, seed=0)
>>> synth = sg.datasets.make_blood_types_df(noise=1, nan_prop=0, seed=0)
>>>
>>> # 2. Instantiate an Evaluator object
>>> ev = sg.Evaluator(real, synth)
>>>
>>> # 3. Explore the data
>>> ev.describe_numeric()
count mean std min 25% 50% 75% max
age_real 1000.0 41.745 7.073472 22.0 37.0 41.0 48.0 62.0
age_synth 1000.0 41.536 9.195829 18.0 35.0 41.0 48.0 68.0
height_real 1000.0 174.976 7.771346 153.0 169.0 176.0 181.0 194.0
height_synth 1000.0 175.266 9.633070 147.0 168.0 176.0 182.0 205.0
weight_real 1000.0 80.014 9.455115 56.0 74.0 80.0 86.0 114.0
weight_synth 1000.0 80.117 11.113452 50.0 72.0 80.0 88.0 118.0
>>> ev.describe_categorical()
count unique most_frequent freq
blood_type_real 1000 4 O 384
blood_type_synth 1000 4 A 535
eye_colour_real 1000 3 Brown 577
eye_colour_synth 1000 3 Brown 664
hair_colour_real 1000 4 Brown 435
hair_colour_synth 1000 4 Brown 480
>>> ev.plot_histograms(figsize=(12,12))
<Figure size 1200x1200 with 6 Axes>
>>>
>>> # 4. Add metrics to compute
>>> ev.add_metric('wasserstein', 'wass-age', feature='age')
>>>
>>> # 5. Evaluate the metrics and review the results
>>> results = ev.evaluate()
>>> print(results)
{'wass-age': 1.7610000000000001}
The API is described in the reference documentation. Please direct any questions to datacampus@ons.gov.uk.
If you encounter any bugs as part of your usage of synthgauge
, please file an
issue detailing as much information as possible and include a minimal
reproducible example. If you wish to contribute code to the project, please
refer to the contribution guidelines.