dataset-generation
There are 689 repositories under dataset-generation topic.
Kiln-AI/Kiln
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
e-p-armstrong/augmentoolkit
Create Custom LLMs
nfstream/nfstream
NFStream: a Flexible Network Data Analysis Framework.
aitorzip/DeepGTAV
A plugin for GTAV that transforms it into a vision-based self-driving car research environment.
rodrigopivi/Chatito
🎯🗯 Dataset generation for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!
aqeelanwar/MaskTheFace
Convert face dataset to masked dataset
DIYer22/bpycv
Computer vision utils for Blender (generate instance annoatation, depth and 6D pose by one line code)
remyxai/VQASynth
Compose multimodal datasets 🎹
HeegyuKim/open-korean-instructions
언어모델을 학습하기 위한 공개 한국어 instruction dataset들을 모아두었습니다.
SimGus/Chatette
A powerful dataset generator for Rasa NLU, inspired by Chatito
fjxmlzn/DoppelGANger
[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
radi-cho/datasetGPT
A command-line interface to generate textual and conversational datasets with LLMs.
facebookresearch/stopes
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
jabberjabberjabber/ImageIndexer
Creates an index of images, queries a local LLM and adds tags to the image metadata
davidmartinrius/speech-dataset-generator
🔊 Create labeled datasets, enhance audio quality, identify speakers, support diverse dataset types. 🎧👥📊 Advanced audio processing.
ylogx/aesthetics
Image Aesthetics Toolkit - includes Fisher Vector implementation, AVA (Image Aesthetic Visual Analysis) dataset and fast multi-threaded downloader
google/imageinwords
Data release for the ImageInWords (IIW) paper.
firmai/datagene
DataGene - Identify How Similar TS Datasets Are to One Another (by @firmai)
pprp/voc2007_for_yolo_torch
:punch: Prepare VOC format datasets for ultralytics/yolov3 & yolov5
ZhangYuanhan-AI/Bamboo
[IJCV] Bamboo: 4 times larger than ImageNet; 2 time larger than Object365; Built by active learning.
hridaydutta123/the-youtube-scraper
Download YouTube video description and video comments without using the YouTube API.
seart-group/ghs
GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them
suvojit-0x55aa/celebA-HQ-dataset-download
Get started with CelebA-HQ dataset in under 5 mins !
AlvaroCavalcante/auto_annotate
Labeling is boring. Use this tool to speed up your next object detection project!
CAS-SIAT-XinHai/CPsyCoun
[ACL 2024] CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling
asaparov/prontoqa
Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
yc9701/pansori
Tools for ASR Corpus Generation from Online Video
futianfan/clinical-trial-outcome-prediction
benchmark dataset and Deep learning method (Hierarchical Interaction Network, HINT) for clinical trial approval probability prediction, published in Cell Patterns 2022.
AgaMiko/pixel_character_generator
Generating retro pixel game characters with Generative Adversarial Networks. Dataset "TinyHero" included.
codelion/pts
Pivotal Token Search
rioharper/VocalForge
Your one-stop solution for voice dataset creation
ZhuLinsen/FastDatasets
A powerful tool for creating high-quality training datasets for Large Language Models (LLMs)(一个快速生成高质量LLM微调训练数据集的工具)
MatteoGuadrini/pyreports
pyreports is a python library that allows you to create complex report from various sources
jim-schwoebel/download_audioset
📁 This repo makes it easy to download the raw audio files from AudioSet (32.45 GB, 632 classes).
cashiwamochi/RealEstate10K_Downloader
These scripts are used to download RealEstate10K dataset.
Spphire/RM-labeling-tool
It's a simulator based on Unity for RoboMaster. You can use it to get some labeled dataset for deep learning