Make sure your gcloud credentials have been setup.
gcloud auth application-default login
Download the datamon binary for mac or for linux on the Releases Page Example:
tar -zxvf datamon.mac.tgz
Configure datamon (for non kubernetes use)
Example:
# Replace path to gcloud credential file. Use absolute path
datamon config create --email ritesh@oneconcern.com --name "Ritesh H Shukla" --credential /Users/ritesh/.config/gcloud/application_default_credentials.json
Configure datamon inside a pod. Datamon will use kubernetes service credentials.
~/datamon config create --name "Ritesh Shukla" --email ritesh@oneconcern.com
Check the config file, credential file will not be set in kubernetes deployment.
# cat ~/.datamon/datamon.yaml
metadata: datamon-meta-data
blob: datamon-blob-data
email: ritesh@oneconcern.com
name: Ritesh H Shukla
credential: /Users/ritesh/.config/gcloud/application_default_credentials.json
Create repo analogous to git repo
datamon repo create --description "Ritesh's repo for testing" --repo ritesh-datamon-test-repo
Upload a bundle, the last line prints the commit hash. This will be needed for downloading the bundle
#datamon bundle upload --path /path/to/data/folder --message "The initial commit for the repo" --repo ritesh-test-repo
Uploaded bundle id:1INzQ5TV4vAAfU2PbRFgPfnzEwR
List bundles in a repo
#datamon bundle list --repo ritesh-test-repo
Using config file: /Users/ritesh/.datamon/datamon.yaml
1INzQ5TV4vAAfU2PbRFgPfnzEwR , 2019-03-12 22:10:24.159704 -0700 PDT , Updating test bundle
Download a bundle
datamon bundle download --repo ritesh-test-repo --destination /path/to/folder/to/download --bundle 1INzQ5TV4vAAfU2PbRFgPfnzEwR
List all files in a bundle
datamon bundle list files --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml
Download a single file from a bundle
datamon bundle download file --file datamon/cmd/repo_list.go --repo ritesh-test-repo --bundle 1ISwIzeAR6m3aOVltAsj1kfQaml --destination /tmp
Please file GitHub issues for features desired in addition to any bugs encountered.
Datamon is a datascience tool that helps managing data at scale. The primary goal of datamon is to allow versioned data creation, access and tracking in an environment where data repositories and their lifecycles are linked.
Datamon links the various sources of data, how they are processed and tracks the output/new data that is generated from the existing data.
Datamon is composed of
- Data Storage
- Data access layer
- CLI
- FUSE
- SDK based tools
- Data consumption integrations.
- CLI
- Kubernetes integration
- GIT LFS
- Jupyter notebook
- JWT integration
Datamon includes a
- Blob storage: Deduplicated storage layer for raw data
- Metadata storage: A metadata storage and query layer
- External storage: Plugable storage sources that are referenced in bundles.
For blob and metadata storage datamon guarantees geo redundant replication of data and is able to withstand region level failures.
For external storage based on the external source, the redundancy and ability to access can vary.
Repo: Analogous to git repos. A repo in datamon is a dataset that has a unified lifecycle.
Branch: A branch represents the various lifecycles data might undergo within a repo.
Bundle: A bundle is a point in time readonly view of a rep:branch and is composed of individual files. Analogous to a commit in git.
Data access layer is implemented in 3 form factors
- CLI Datamon can be used as a standalone CLI provided developer has access privileges to the backend storage. A developer can always setup datamon to host their own private instance for managing and tracking their own data.
- Filesystem: A bundle can be mounted as a file system in Linux or Mac and new bundles can be generated as well.
- Specialized tooling can be written for specific use cases. Example: Parallel ingest into a bundle for high scaled out throughput.
Datamon integrates with kubernetes to allow for pod access to data and pod execution synchronization based on dependency on data. Datamon also caches data within the cluster and informs the placement of pods based on cache locality.
Datamon will act as a backend for [GIT LFS](oneconcern#79
Datamon allows for Jupyter notebook to read in bundles in a repo and process them and create new bundles based on data generated
Datamon API/Tooling can be used to write custom services to ingest large data sets into datamon. These services can be deployed in kubernetes to manage the long duration ingest.
This was used to move data from AWS to GCP.
Datamon can serve bundles as well as consume data that is authenticated via JWT