synthetic-dataset-generation
There are 295 repositories under synthetic-dataset-generation topic.
microsoft/presidio
An open-source framework for detecting, redacting, masking, and anonymizing sensitive data (PII) across text, images, and structured data. Supports NLP, pattern matching, and customizable pipelines.
argilla-io/distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Eladlev/AutoPrompt
A framework for prompt tuning using Intent-based Prompt Calibration
bespokelabsai/curator
Synthetic data curation for post-training and structured data extraction
datadreamer-dev/DataDreamer
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤
Unity-Technologies/com.unity.perception
Perception toolkit for sim2real training and validation in Unity
BatsResearch/bonito
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.
magpie-align/magpie
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data generation pipeline!
nicolas-hbt/pygraft
Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips
paulbricman/thisrepositorydoesnotexist
A curated list of awesome projects which use Machine Learning to generate synthetic content.
NVIDIA/Dataset_Synthesizer
NVIDIA Deep learning Dataset Synthesizer (NDDS)
remyxai/VQASynth
Compose multimodal datasets 🎹
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
stacklok/promptwright
Generate large synthetic data using an LLM
Unity-Technologies/SynthDet
SynthDet - An end-to-end object detection pipeline using synthetic data
zhenzhiwang/HumanVid
[NeurIPS D&B Track 2024] Official implementation of HumanVid
Unity-Technologies/PeopleSansPeople
Unity's privacy-preserving human-centric synthetic data generator
tirthajyoti/pydbgen
Random dataframe and database table generator
fjxmlzn/DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
davanstrien/awesome-synthetic-datasets
awesome synthetic (text) datasets
worldbank/REaLTabFormer
A suite of auto-regressive and Seq2Seq (sequence-to-sequence) transformer models for tabular and relational synthetic data generation.
firmai/datagene
DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)
KodCode-AI/kodcode
✨ A synthetic dataset generation framework that produces diverse coding questions and verifiable solutions - all in one framwork
SqueezeAILab/LLM2LLM
[ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
rozumden/DeFMO
[CVPR 2021] DeFMO: Deblurring and Shape Recovery of Fast Moving Objects
PerceivingSystems/bedlam_render
BEDLAM (CVPR 2023) render pipeline tools
ALucek/QuicKB
Optimize Document Retrieval with Fine-Tuned KnowledgeBases
BhabhaAI/dataformer
Solving data for LLMs - Create quality synthetic datasets!
ViLab-UCSD/OpenRooms
This is the dataset and code release of the OpenRooms Dataset. For more information, please refer to our webpage below. Thanks a lot for your interest in our research!
nupurkmr9/syncd
SynCD: Generating Multi-Image Synthetic Data for Text-to-Image Customization
NVIDIA/Dataset_Utilities
NVIDIA Dataset Utilities (NVDU)
isarandi/synthetic-occlusion
Synthetic Occlusion Augmentation
VinAIResearch/Dataset-Diffusion
Dataset Diffusion: Diffusion-based Synthetic Data Generation for Pixel-Level Semantic Segmentation (NeurIPS2023)
jtheiner/LegoBrickClassification
Repository to identify Lego bricks automatically only using images
firmai/mtss-gan
MTSS-GAN: Multivariate Time Series Simulation with Generative Adversarial Networks (by @firmai)
netsharecmu/NetShare
(SIGCOMM '22) Practical GAN-based Synthetic IP Header Trace Generation using NetShare