CLI Guide

Make sure your gcloud credentials have been setup.

gcloud auth application-default login

Download the datamon binary for mac or for linux on the Releases Page Example:

tar -zxvf datamon.mac.tgz

Configure datamon (for non kubernetes use)

Example:

# Replace path to gcloud credential file. Use absolute path
datamon config create --email ritesh@oneconcern.com --name "Ritesh H Shukla" --credential /Users/ritesh/.config/gcloud/application_default_credentials.json

Configure datamon inside a pod. Datamon will use kubernetes service credentials.

~/datamon config create --name "Ritesh Shukla" --email ritesh@oneconcern.com

Check the config file, credential file will not be set in kubernetes deployment.

# cat ~/.datamon/datamon.yaml 
metadata: datamon-meta-data
blob: datamon-blob-data
email: ritesh@oneconcern.com
name: Ritesh H Shukla
credential: /Users/ritesh/.config/gcloud/application_default_credentials.json

Create repo analogous to git repo

datamon repo create  --description "Ritesh's repo for testing" --repo ritesh-datamon-test-repo

Upload a bundle, the last line prints the commit hash. This will be needed for downloading the bundle

#datamon bundle upload --path /path/to/data/folder --message "The initial commit for the repo" --repo ritesh-test-repo
Uploaded bundle id:1INzQ5TV4vAAfU2PbRFgPfnzEwR

List bundles in a repo

#datamon bundle list --repo ritesh-test-repo                                                                                                                
Using config file: /Users/ritesh/.datamon/datamon.yaml
1INzQ5TV4vAAfU2PbRFgPfnzEwR , 2019-03-12 22:10:24.159704 -0700 PDT , Updating test bundle

Download a bundle

datamon bundle download --repo ritesh-test-repo --destination /path/to/folder/to/download --bundle 1INzQ5TV4vAAfU2PbRFgPfnzEwR

List all files in a bundle

datamon bundle list files --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml

Download a single file from a bundle

datamon bundle download file --file datamon/cmd/repo_list.go --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml --destination /tmp

Feature requests and bugs

Please file GitHub issues for features desired in addition to any bugs encountered.

Datamon

Datamon is a datascience tool that helps managing data at scale. The primary goal of datamon is to allow versioned data creation, access and tracking in an environment where data repositories and their lifecycles are linked.

Datamon links the various sources of data, how they are processed and tracks the output/new data that is generated from the existing data.

Design

Datamon is composed of

Data Storage
Data access layer
1. CLI
2. FUSE
3. SDK based tools
Data consumption integrations.
1. CLI
2. Kubernetes integration
3. GIT LFS
4. Jupyter notebook
5. JWT integration

Data Storage

Datamon includes a

Blob storage: Deduplicated storage layer for raw data
Metadata storage: A metadata storage and query layer
External storage: Plugable storage sources that are referenced in bundles.

For blob and metadata storage datamon guarantees geo redundant replication of data and is able to withstand region level failures.

For external storage based on the external source, the redundancy and ability to access can vary.

Data modeling

Repo: Analogous to git repos. A repo in datamon is a dataset that has a unified lifecycle.

Branch: A branch represents the various lifecycles data might undergo within a repo.

Bundle: A bundle is a point in time readonly view of a rep:branch and is composed of individual files. Analogous to a commit in git.

Data Access layer

Data access layer is implemented in 3 form factors

CLI Datamon can be used as a standalone CLI provided developer has access privileges to the backend storage. A developer can always setup datamon to host their own private instance for managing and tracking their own data.
Filesystem: A bundle can be mounted as a file system in Linux or Mac and new bundles can be generated as well.
Specialized tooling can be written for specific use cases. Example: Parallel ingest into a bundle for high scaled out throughput.

Data consumption integration

Kubernetes integration

Datamon integrates with kubernetes to allow for pod access to data and pod execution synchronization based on dependency on data. Datamon also caches data within the cluster and informs the placement of pods based on cache locality.

GIT LFS

Datamon will act as a backend for [GIT LFS](oneconcern#79

Jupyter notebook.

Datamon allows for Jupyter notebook to read in bundles in a repo and process them and create new bundles based on data generated

Data access layer

Datamon API/Tooling can be used to write custom services to ingest large data sets into datamon. These services can be deployed in kubernetes to manage the long duration ingest.

kerneltime/datamon-1