Datamon is a data science tool sponsored by OneConcern that helps managing data at scale.
The primary goal of datamon is to manage versioned data at rest, providing CLI tools for creation, access and tracking in an environment where data repositories and their lifecycles are linked.
Datamon links the various sources of data, how they are processed and tracks the output/new data that is generated from the existing data.
More on design and architecture.
Although flexible in its concepts and architecture, the current version of datamon is primarily developed and tested against the Google Cloud environment. Note that AWS S3 storage buckets are supported (see datamover tool).
Datamon supports the following cloud storage backends:
- Google Cloud Storage
- AWS S3
- Repo: analogous to a git repo. A repo in datamon is a dataset that has a unified lifecycle.
- Bundle: a bundle is a point in time read-only view of a rep:branch and is composed of individual files. Analogous to a commit in git.
- Label: a name given to a bundle, analogous to tags in git. Examples: Latest, production.
- Context: a context provides a way to define multiple instances of datamon.
- Write Ahead Log: a WAL track data updates and their ordering.
- Read Log: this logs all read operations, with their originator.
- Authentication: datamon keeps track of who contributed what, when and in which order (WAL) and who accessed what (Read Log).
Please follow the installation instructions.
Datamon comes as a CLI tool: see usage.
- ARGO ML pipeline
- Datamover container guide
- Datamon as sidecar
- Kubernetes integration
Please file GitHub issues for feature requests or bug reports.
Please read our contributing guidelines
Datamon is developed by OneConcern Inc. under the MIT license.