training-data
There are 167 repositories under training-data topic.
snorkel-team/snorkel
A system for quickly generating training data with weak supervision
diffgram/diffgram
The AI Datastore for Schemas, BLOBs, and Predictions. Use with your apps or integrate built-in Human Supervision, Data Workflow, and UI Catalog to get the most value out of your AI Data.
ydataai/ydata-synthetic
Synthetic data generators for tabular and time-series data
NorskRegnesentral/skweak
skweak: A software toolkit for weak supervision applied to NLP tasks
OvidijusParsiunas/myvision
Computer vision based ML training data generation tool :rocket:
alteryx/compose
A machine learning tool for automated prediction engineering. It allows you to easily structure prediction problems and generate labels for supervised learning.
a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
sparkfish/augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Slava/label-tool
Web application for image labeling and segmentation
d5555/TagEditor
🏖TagEditor - Annotation tool for spaCy
Geocene/trainset
A lightweight web application for brushing labels onto time series data; useful for building training sets.
KennethEnevoldsen/augmenty
Augmenty is an augmentation library based on spaCy for augmenting texts.
tzano/fountain
Natural Language Data Augmentation Tool for Conversational Systems
avinashsen707/AUBOi5-D435-ROS-DOPE
Aubo i5 Dual Arm Collaborative Robot - RealSense D435 - 3D Object Pose Estimation - ROS
enginBozkurt/carla-training-data
Generating training data from the Carla driving simulator in the KITTI dataset format
rahul051296/small-talk-rasa-stack
Collection of casual conversations that can be used with the Rasa Stack
google-research-datasets/swim-ir
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
hernanmd/COVID-19-train-audio
COVID-19 Coughs files for training AI models
ableinc/git2txt
Convert all files in git repository to .txt files. Useful for training LLMs on your codebase.
milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation
Full resources supporting the publication "A Pragmatic Guide to Geoparsing Evaluation."
megagonlabs/ruler
Data Programming by Demonstration (DPBD) for Document Classification
InstaPy/instapy-gender-classification
🔎 Classification helper for sex classification feature of InstaPy
benbo/interactive-weak-supervision
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
rhammell/planesnet
Labeled training data for detection of aircraft in Planet satellite imagery
alexkalinins/hairnet-ai
Machine Learning project aimed at converting images into .obj 3D models by representing them as Blender hair-type particle systems.
ajsanjoaquin/Shapley_Valuation
PyTorch reimplementation of computing Shapley values via Truncated Monte Carlo sampling from "What is your data worth? Equitable Valuation of Data" by Amirata Ghorbani and James Zou [ICML 2019]
trainingdata/AIAssistedImageVideoLabelling
AI Assisted Image and Video Training Data Labeling @ Scale
abinashmeher999/voice-data-extract
A command line interface to combine text information from subtitles with voice data in the video. Provides a convenient way to generate training data for speech-recognition purposes.
dterg/biomedical_corpora
Table compiling the list of biomedically-related corpora available for named entity recognition (and some also suitable for association detection). First version has was published as part of the paper: Dieter Galea, Ivan Laponogov, Kirill Veselkov; Exploiting and assessing multi-source data for supervised biomedical named entity recognition, Bioinformatics, bty152, https://doi.org/10.1093/bioinformatics/bty152 . If you would like to add other (or your) corpora, please submit a pull request and I'll happily approve it.
MinhasKamal/AlphabetRecognizer
Simple Optical Character Recognizer (english-ocr-image-to-text-recognition-sample-trainig-alphabet-photo-data-database-dataset)
wakakalu/TransE
A simple implement of TransE, the ML algorithm published in 2013
bot-astro/gpt-3-training-data
A set of questions & answers used to train a chatGPT model.
stritti/thermal-solar-plant-dataset
Realtime Thermal Solar Plant Dataset for Machine Learning
hou2zi0/minimal-RTE__ner-training-data
Minimal customization of Quill.js Rich Text Editor for easy annotation of text snippets for NER model training with spaCy.
StevePny/DataAssimBench
Benchmarking tools for applying AI/ML to data assimilation
MaaAssistantArknights/ArknightsTrainingData
明日方舟相关机器学习训练数据 | Machine learning training data for Arknights