Curated list of open source tooling for data-centric AI on unstructured data.
CC-BY-4.0
Awesome open data-centric AI
Open source tooling for data-centric AI on unstructured data
Data-centric AI (DCAI) is a development paradigm for ML-based solutions. The term was coined by Andrew Ng who gave the following definition:
Data-centric AI is the practice of systematically engineering the data used to build AI systems.
At Renumics, we believe DCAI is an important puzzle piece for building real-world AI systems that generate value. We like the following definition:
Data-centric AI means to improve training datasets systematically and iteratively by leveraging information from trained ML models.
Tools that can be efficiently used in day-to-day applications are the most important ingredient for the DCAI paradigm. This curated link collection is intended to help you discover useful open source tools for your data-centric AI workflows.
🔎 Scope
We include useful tools that have an open-source license and are actively maintained in this collection. All tools mentioned are useful for building DCAI workflows on unstructured data (e.g. images, audio, video, time-series, text).
In order to keep a useful focus and to prevent duplicate work, we exclude the following topics:
DCAI tools for tabular data. There is an awesome list for that maintained by the Ydata team.
Labeling tools. Although labeling is part of the DCAI workflow, we refer to the awesome list of the ZenML team on that topic.
MLOps tooling. There are many gray areas between MLOps and DCAI and some distinctions have yet to be made. We exclude all topics that are clearly out of the DCAI scope (e.g. AutoML, serving, orchestration etc.).