The Data-Centric AI Community just launched a small community project to experiment with the NIST Challenge!
- Goal: To learn about Synthetic Data and how it can be used to prepare sensitive private data for public release!
- Dates: From April to July. You can also join at any time, follow the weekly plan, and post questions on our Discord.
- Where: 🤖-nist-challenge channel in our Discord Server
- Touch Points: We meet every Friday around 4 PM GTM on the 🧠-code-with-me channel to discuss the project.
🎯 The overall goal of the project is to explore synthetic data to prepare sensitive private data for public release.
📀 NIST has launched a benchmark of 3 datasets, MA
, TX
(Texas), and NATIONAL
which you can use in the project.
📊 To provide an evaluation of the de-identified data against the target/real data, NIST has created the sdnist
package that can be installed according to the instructions below.
💻 To create the de-identified data, we'll use ydata-synthetic
package, explore different model settings and study the effect this has on the final results.
Week | What you will learn |
---|---|
1 | Goal and objectives of the project. You'll connect with other learners in the DCAI Discord Server and be added to the NIST Team to access the 🤖-nist-challenge channel and receive permissions to collaborate on the GitHub project. |
2 | Basics of Synthetic Data. You will learn more about what is synthetic data, how is it generated, what are the main applications. |
3 | Basics of Data Profiling. You will learn what is data profiling, how to understand your data with descriptive statistics, and what are common data quality issues. You will also explore the NIST datasets with ydata-profiling and preprocess the data according to your findings. |
4 & 5 | Generation of Synthetic Data. You will explore Deep Learning models (Generative Adversarial Networks -- GAN) to generate realistic synthetic data using ydata-synthetic . |
6 & 7 | Basics of Evaluating Synthetic Data. You will explore some strategies to evaluate synthetic data and investigate possible improvements to your solution. We will explore the sdnist package to evaluate our synthetic data. |
8 | Project Showcase. You will learn how to best showcase and publicize your project in your data portfolio, CV, GitHub, or Medium Account. |
- Read the instructions and information about the challenge
- Learn about the benchmark data released -- The NIST Diverse Communities Data Excerpts
- Post questions and ideas on the 🤖-nist-challenge channel
- Learn about the basic aspects of Synthetic Data:
- Post questions and comments on the 🤖-nist-challenge channel
- Meet us on Friday (May 5) to discuss what you've learned (check the available slots on our 📅 Discord Calendar)
-
Learn about the basic aspects of Data Profiling:
- 📺 Auditing Data Quality with ydata-profiling: learn about what is data profiling, what common data quality issues we find in real-world domains (can you spot a few in the NIST datasets?), and how
ydata-profiling
can help you diagnose and overcome them - 📖 Awesome Data Science Tools to Master in 2023: Data Profiling Edition: learn more about data profiling and existing open source tools to understand your data to the fullest!
- 📖 Auditing Data Quality with YData Profiling: an overview of
ydata-profiling
functionalities and how-to's
- 📺 Auditing Data Quality with ydata-profiling: learn about what is data profiling, what common data quality issues we find in real-world domains (can you spot a few in the NIST datasets?), and how
-
Start profiling the NIST data:
- Install
ydata-profiling
(check the Installation Instructions below) and don't forget to star it, thank you! ⭐️ - Choose one of the NIST datasets (
MA
,TX
, orNATIONAL
):- The datasets are available here
- Run a Profile Report on your data (check the Installation Instructions below)
- Create an excel file to register your learnings. Suggestion for the columns:
Feature Name | Data Type (Numeric/Categorical) | Missing Values (Y/N) | Notes/Observations
. Your observations should be based on the profiling report, but also on the description of the features provided
- Install
-
Post questions and comments on the 🤖-nist-challenge channel.
-
Meet us on Friday (May 12) to discuss what you've learned (check the available slots on our 📅 Discord Calendar). Don't forget to bring your excel file with the data description and your profiling report!
- Investigate
ydata-synthetic
and some of the models used to Generate Synthetic Data: - Start experimenting with
ydata-synthetic
(check the Installation Instructions below and don't forget to star it, thank you! ⭐️). If you prefer a UI experience, you can also leverage the Streamlit App in version 1.0.0: - Compare your synthetic data with the real data using the
.compare()
functionality ofydata-profiling
:- 📖 How to compare 2 datasets with ydata-profiling. What are the obtained results? Are there any aspects that you can improve?
- Post questions and comments on the 🤖-nist-challenge channel! You can upload your profiling reports the the channel so that we can discuss changes and improvements.
📦 How to create and use Virtual Environments?
A lot of troubleshooting arises due to misalignments between environments and package requirements. If you're new to data science development, maybe you just install packages unto your global Python environment. This may turn into a lot of headaches when project requirements are conflicting.
Virtual Environments are ideal to overcome this issue: they isolate your installations from the "global" environment, so that you don't have to worry about conflicts. If you've never used virtual environments for your data science projects, you can start by installing anaconda. If you need a little convincing that this is a nice tool to have on your belt, then check this post comparing conda
with pip
, venv
, and pyenv
.
Once anaconda is installed, creating a new environment is as easy as running this on your shell:
conda create --name synth-env python=3.10
This creates a new environment called synth-env with Python version 3.10.X. You can then switch to this environment by activating it:
conda activate synth-env
In this new environment, you can still call pip
to install python packages, such as ydata-synthetic
:
pip install ydata-synthetic
Now you can open up your Python editor or Jupyter Notebook and use the synth-env
as your development environment, without having to worry with conflicting versions or packages between projects! Once you're done, you can deactivate the environment using:
conda deactivate synth-env
- 📖 Environments, Conda, Pip, aaaaah!: How to manage Python Environments without a headache
- 📺 How to "pip install ydata-synthetic" without errors!: How to install anaconda, create a virtual environment using
conda
, install packages withpip
, and use the virtual environments in PyCharm or Jupyter Notebooks
📊 How to install ydata-profiling and create a Profiling Report?
You may start by creating your virtual environment and installing the package:
conda create -n synth-env python=3.10
conda activate synth-env
pip install ydata-profiling==4.1.2
Then, in your Jupyter Notebook or other editor (e.g., PyCharm), load your Pandas DataFrame as you normally would and the generation of the profiling report is straightforward:
import pandas as pd
from pandas_profiling import ProfileReport
# Read the data from a csv file (NIST "MA" data in the example)
df = pd.read_csv("ma2019.csv")
# Generate the data profiling report
original_report = ProfileReport(df, title='Original Data')
original_report.to_file("original_report.html")
You can then navigate the report to investigate the data quality issues generated, and study the basic descriptive statistics of your data!
- 📚 Examples with real-world datasets: A list of examples and data profiling reports and usage of ydata-profiling
- 🙇🏽♂️ Read the Docs: Documentation: from installation and quickstart to integrations and advanced usage
🤖 How to install ydata-synthetic and create a synthesizer?
You may use you previous virtual environment (synth-env
). Activate it and and install the package:
conda activate synth-env
pip install ydata-synthetic==1.1.0
Then, you can leverage one of the models available in the package. In this example, we will be using CTGAN:
# Load data
data = fetch_data('adult')
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']
# Defining the training and model parameters
batch_size = 500
epochs = 500+1
learning_rate = 2e-4
beta_1 = 0.5
beta_2 = 0.9
# Create and train the model
ctgan_args = ModelParameters(batch_size=batch_size,
lr=learning_rate,
betas=(beta_1, beta_2))
train_args = TrainParameters(epochs=epochs)
synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=data, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)
# Generate new samples
synth_data = synth.sample(1000)
print(synth_data)
You can also check further examples with other models.