/NetShare

(SIGCOMM '22) Practical GAN-based Synthetic IP Header Trace Generation using NetShare

Primary LanguagePythonBSD 3-Clause Clear LicenseBSD-3-Clause-Clear

Practical GAN-based Synthetic IP Header Trace Generation using NetShare

[paper (SIGCOMM 2022)] [talk (SIGCOMM 2022)] [talk (ZeekWeek 2022)] [talk (FloCon 2023)] [web service demo]

Authors: [Yucheng Yin] [Zinan Lin] [Minhao Jin] [Giulia Fanti] [Vyas Sekar]

Abstract: We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for network-ing tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across distributional metrics and traces, it achieves 46% more accuracy than baselines, and (2) it meets users’ requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.

News

[2023.04] Woohoo! New version released with a list of new features:

  • Bump Python version to 3.9
  • Replace tensorflow 1.15 with torch
  • Support generic dataset formats
  • Add SDMetrics for hyperparameter/model selection and data visualization

[2022.08]: The deprecated camera-ready branch holds the scripts we used to run all the experiments in the paper.

Users

NetShare has been used by several independent users/companies.

Datasets

We are adding more datasets! Feel free to add your own and contribute!

Our paper uses six public datasets for reproducibility. Please download the six datasets here and put them under traces/.

You may also refer to the README for detailed descriptions of the datasets.

Setup

Step 0: Install libpcap depdency (Optional)

If you are working with PCAP files and you have not installed libpcap,

  • On MacOS, install using homebrew:
    brew install libpcap
  • On Debian-based system (e.g., Ubuntu), install using apt:
    sudo apt install libpcap-dev

Step 1: Install NetShare Python package (Required)

We recommend installing NetShare in a virtual environment (e.g., Anaconda3). We test with virtual environment with Python==3.9.

# Assume Anaconda is installed
# Create virtual environment if not exists
conda create --name NetShare python=3.9

# Activate virtual env
conda activate NetShare

# Install NetShare package
git clone https://github.com/netsharecmu/NetShare.git
pip3 install -e NetShare/

# Install SDMetrics package
git clone https://github.com/netsharecmu/SDMetrics_timeseries
pip3 install -e SDMetrics_timeseries/

Step 2: How to start Ray? (Optional but strongly recommended)

Ray is a unified framework for scaling AI and Python applications. Our framework utilizes Ray to increase parallelism and distribute workloads among the cluster automatically and efficiently.

Laptop/Single-machine (only recommended for demo/dev/fun)

ray start --head --port=6379 --include-dashboard=True --dashboard-host=0.0.0.0 --dashboard-port=8265

Please go to http://localhost:8265 to view the Ray dashboard.

Multi-machines (strongly recommended for faster training/generation)

We provide a utility script and README under util/ for setting up a Ray cluster. As a reference, we are using Cloudlab which is referred as ``custom cluster'' in the Ray documentation. If you are using a different cluster (e.g., AWS, GCP, Azure), please refer to the Ray doc for full reference.

Example usage

We are adding more examples of usage (PCAP, NetFlow, w/ and w/o DP). Please stay tuned!

Here is a minimal working example to generate synthetic netflow files without differential privacy. Please change your working directory to examples/<sub_example> by cd examples/<sub_example>.

You may refer to examples for more scripts and config files.

Driver code

import random
import netshare.ray as ray
from netshare import Generator

if __name__ == '__main__':
    # Change to False if you would not like to use Ray
    ray.config.enabled = False
    ray.init(address="auto")

    # configuration file
    generator = Generator(config="config_example_netflow_nodp.json")

    # `work_folder` should not exist o/w an overwrite error will be thrown.
    # Please set the `worker_folder` as *absolute path*
    # if you are using Ray with multi-machine setup
    # since Ray has bugs when dealing with relative paths.
    generator.train(work_folder=f'../../results/test-ugr16')
    generator.generate(work_folder=f'../../results/test-ugr16')
    generator.visualize(work_folder=f'../../results/test-ugr16')

    ray.shutdown()

The corresponding configuration file. You may refer to README for detailed explanations of the configuration files.

After generation, you will be redirected to a dashboard where a side-to-side visual comparison between real and synthetic data will be shown.

Codebase structure (for dev purpose)

├── doc                       # (tentative) NetShare tutorials and APIs
├── examples                  # Examples of using NetShare on different datasets
├── netshare                  # NetShare source code
│   ├── configs               # Default configurations  
│   ├── generators            # Generator class
│   ├── model_managers        # Core of NetShare service (i.e, train/generate)
│   ├── models                # Timeseries GAN models (e.g., DoppelGANger)
│   ├── pre_post_processors   # Pre- and post-process data
│   ├── ray                   # Ray functions overloading
│   └── utils                 # Utility functions/common class definitions
├── traces                    # Traces/datasets
└── util                      # MISC/setup scripts
    └── ray                   # Ray setup script

References

Please cite our paper/codebase approriately if you find NetShare is useful.

@inproceedings{netshare-sigcomm2022,
  author = {Yin, Yucheng and Lin, Zinan and Jin, Minhao and Fanti, Giulia and Sekar, Vyas},
  title = {Practical GAN-Based Synthetic IP Header Trace Generation Using NetShare},
  year = {2022},
  isbn = {9781450394208},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3544216.3544251},
  doi = {10.1145/3544216.3544251},
  abstract = {We explore the feasibility of using Generative Adversarial Networks (GANs) to automatically learn generative models to generate synthetic packet- and flow header traces for networking tasks (e.g., telemetry, anomaly detection, provisioning). We identify key fidelity, scalability, and privacy challenges and tradeoffs in existing GAN-based approaches. By synthesizing domain-specific insights with recent advances in machine learning and privacy, we identify design choices to tackle these challenges. Building on these insights, we develop an end-to-end framework, NetShare. We evaluate NetShare on six diverse packet header traces and find that: (1) across all distributional metrics and traces, it achieves 46% more accuracy than baselines and (2) it meets users' requirements of downstream tasks in evaluating accuracy and rank ordering of candidate approaches.},
  booktitle = {Proceedings of the ACM SIGCOMM 2022 Conference},
  pages = {458–472},
  numpages = {15},
  keywords = {privacy, synthetic data generation, network packets, network flows, generative adversarial networks},
  location = {Amsterdam, Netherlands},
  series = {SIGCOMM '22}
}

Part of the source code is adapated from the following open-source projects: