/nist-crc-2023

NIST Collaborative Research Cycle on Synthetic Data. Learn about Synthetic Data week by week!

Primary LanguageJupyter NotebookMIT LicenseMIT

Discord Youtube Medium YData Synthetic

NIST Privacy Collaborative Reseach

The Data-Centric AI Community just launched a small community project to experiment with the NIST Challenge!

  • Goal: To learn about Synthetic Data and how it can be used to prepare sensitive private data for public release!
  • Dates: From April to July. You can also join at any time, follow the weekly plan, and post questions on our Discord.
  • Where: 🤖-nist-challenge channel in our Discord Server
  • Touch Points: We meet every Friday around 4 PM GTM on the 🧠-code-with-me channel to discuss the project.

Overview

🎯 The overall goal of the project is to explore synthetic data to prepare sensitive private data for public release.

📀 NIST has launched a benchmark of 3 datasets, MA, TX (Texas), and NATIONAL which you can use in the project.

📊 To provide an evaluation of the de-identified data against the target/real data, NIST has created the sdnist package that can be installed according to the instructions below.

💻 To create the de-identified data, we'll use ydata-synthetic package, explore different model settings and study the effect this has on the final results.

🧭 Learning Outcomes

Week What you will learn
1 Goal and objectives of the project. You'll connect with other learners in the DCAI Discord Server and be added to the NIST Team to access the 🤖-nist-challenge channel and receive permissions to collaborate on the GitHub project.
2 Basics of Synthetic Data. You will learn more about what is synthetic data, how is it generated, what are the main applications.
3 Basics of Data Profiling. You will learn what is data profiling, how to understand your data with descriptive statistics, and what are common data quality issues. You will also explore the NIST datasets with ydata-profiling and preprocess the data according to your findings.
4 & 5 Generation of Synthetic Data. You will explore Deep Learning models (Generative Adversarial Networks -- GAN) to generate realistic synthetic data using ydata-synthetic.
6 & 7 Basics of Evaluating Synthetic Data. You will explore some strategies to evaluate synthetic data and investigate possible improvements to your solution. We will explore the sdnist package to evaluate our synthetic data.
8 Project Showcase. You will learn how to best showcase and publicize your project in your data portfolio, CV, GitHub, or Medium Account.

🔨 Tasks

Week 1:

Week 2:

Week 3:

  • Learn about the basic aspects of Data Profiling:

  • Start profiling the NIST data:

    • Installydata-profiling (check the Installation Instructions below) and don't forget to star it, thank you! ⭐️
    • Choose one of the NIST datasets (MA, TX, or NATIONAL):
      • The datasets are available here
      • Run a Profile Report on your data (check the Installation Instructions below)
      • Create an excel file to register your learnings. Suggestion for the columns: Feature Name | Data Type (Numeric/Categorical) | Missing Values (Y/N) | Notes/Observations. Your observations should be based on the profiling report, but also on the description of the features provided
  • Post questions and comments on the 🤖-nist-challenge channel.

  • Meet us on Friday (May 12) to discuss what you've learned (check the available slots on our 📅 Discord Calendar). Don't forget to bring your excel file with the data description and your profiling report!

Weeks 4 & 5:

⚙️ Installation Instructions

📦 How to create and use Virtual Environments?

A lot of troubleshooting arises due to misalignments between environments and package requirements. If you're new to data science development, maybe you just install packages unto your global Python environment. This may turn into a lot of headaches when project requirements are conflicting.

Virtual Environments are ideal to overcome this issue: they isolate your installations from the "global" environment, so that you don't have to worry about conflicts. If you've never used virtual environments for your data science projects, you can start by installing anaconda. If you need a little convincing that this is a nice tool to have on your belt, then check this post comparing conda with pip, venv, and pyenv.

Once anaconda is installed, creating a new environment is as easy as running this on your shell:

conda create --name synth-env python=3.10

This creates a new environment called synth-env with Python version 3.10.X. You can then switch to this environment by activating it:

conda activate synth-env

In this new environment, you can still call pip to install python packages, such as ydata-synthetic:

pip install ydata-synthetic

Now you can open up your Python editor or Jupyter Notebook and use the synth-env as your development environment, without having to worry with conflicting versions or packages between projects! Once you're done, you can deactivate the environment using:

conda deactivate synth-env

Suggested Materials

📊 How to install ydata-profiling and create a Profiling Report?

You may start by creating your virtual environment and installing the package:

conda create -n synth-env python=3.10
conda activate synth-env
pip install ydata-profiling==4.1.2

Then, in your Jupyter Notebook or other editor (e.g., PyCharm), load your Pandas DataFrame as you normally would and the generation of the profiling report is straightforward:

import pandas as pd
from pandas_profiling import ProfileReport

# Read the data from a csv file (NIST "MA" data in the example)
df = pd.read_csv("ma2019.csv")

# Generate the data profiling report 
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")

You can then navigate the report to investigate the data quality issues generated, and study the basic descriptive statistics of your data!

Additional Materials

🤖 How to install ydata-synthetic and create a synthesizer?

You may use you previous virtual environment (synth-env). Activate it and and install the package:

conda activate synth-env
pip install ydata-synthetic==1.1.0

Then, you can leverage one of the models available in the package. In this example, we will be using CTGAN:

# Load data
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']

# Defining the training and model parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9

# Create and train the model
ctgan_args = ModelParameters(batch_size=batch_size,
                             lr=learning_rate,
                             betas=(beta_1, beta_2))

train_args = TrainParameters(epochs=epochs)

synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

# Generate new samples
synth_data = synth.sample(1000)

print(synth_data)

You can also check further examples with other models.