
Cloud neutral datalake defiition file format and management CLI

Cloud neutral datalake definition file format and management CLI

By introducing a specific format for defining the shape of your datalake, we can easily use simple templating mechanisms in order to generate the needed artifacts for publishing the datalake to a cloud vendor. Unlike other templating options, this does not result in a restricted datalake shape. You may add as many datalake paths (buckets, containers, filesystems, etc) or as many datalake roles to the definition as you like. The resulting artifacts can be further transformed into terraform scripts, used with cloud vendor APIs or CLIs or copy and pasted manually.

The same file can then be used by additional consumers in order to map users to the most appropriate IAM Roles based on identity and paths being accessed, provide meaningful visualizations of the datalake for use in UIs, etc.

Your datalake definitions may also be commited to SCM systems to share them and allow multiple admins to maintain them over the lifetime of your datalakes.


  1. Clone this project to your local machine

  2. cd datalake-def

  3. Install python modules needed by app

     python setup.py install
  4. ../ddt.py "set debug true" "new_datalake --name {name}" "build_datalake [--name {name}] --cloud AWS" quit


 ddt> lmccay@strange:~/Projects/datalake-def$ ./ddt.py 
 ddt> help -v
 Documented commands (use 'help -v' for verbose/'help <topic>' for details):
 add_path            Add a Storage Path to the Definition.
 add_role            Add a DataLake Role to the Definition.
 alias               Manage aliases
 build_datalake      Build IAM artifacts for given vendor from the DataLake DDF.
 edit                Run a text editor and optionally open a file with it
 help                List available commands or provide detailed help for a specific command
 history             View, run, edit, save, or clear previously entered commands
 load_datalake       Load a DataLake from persisted DDF.
 macro               Manage macros
 new_datalake        Create a new DataLake DDF.
 paths               List all Storage paths in the Definition.
 push_datalake       Publish IAM artifacts and buckets to given vendor.
 py                  Invoke Python command or shell
 quit                Exit this application
 recall_datalake     Unpublish IAM artifacts and buckets from given vendor.
 roles               List all DataLake Roles in the Definition.
 run_pyscript        Run a Python script file inside the console
 run_script          Run commands in script file that is encoded as either ASCII or UTF-8 text
 save_datalake       Persist a DataLake DDF.
 set                 Set a settable parameter or show current settings of parameters
 shell               Execute a command as if at the OS prompt
 shortcuts           List available shortcuts

Example use:

 lmccay@strange:~/Projects/datalake-def$ ./ddt.py "set debug true" "new_datalake -n ljm" "build_datalake -c AWS" quit
 debug - was: False
 now: True
 datalake: ljm
     iam_role: cdp-ljm-admin-s3-role
     permissions: ['storage:full-access:STORAGE_LOCATION_BASE']
     iam_role: cdp-ljm-idbroker-assume-role
     instance_profile: true
     permissions: ['sts:assume-roles']
     iam_role: cdp-ljm-log-role
     instance_profile: true
     permissions: ['storage:read-write:LOGS_LOCATION_BASE']
     iam_role: cdp-ljm-ranger-audit-s3-role
     permissions: ['storage:full-object-access:RANGER_AUDIT_LOCATION', 'storage:list-only:DATALAKE_BUCKET']
 nosql: {TABLE_NAME: ljm}
     full-access: {description: the force, rank: 1}
     full-object-access: {description: jedi master, rank: 2}
     list-only: {description: youngling, rank: 5}
     read-only: {description: padawan, rank: 4}
     read-write: {description: jedi knight, rank: 3}
     assume-roles: {description: shapeshifter, rank: 1}
   DATALAKE_BUCKET: {path: /ljm/data}
   LOGS_BUCKET: {path: /ljm}
   LOGS_LOCATION_BASE: {path: /ljm/logs}
   RANGER_AUDIT_LOCATION: {path: /ljm/ranger/audit}
   STORAGE_LOCATION_BASE: {path: /ljm}
 Cloud Type: AWS
 Vendor: Amazon
 Building Amazon Cloud artifacts for datalake named: ljm...
 The datalake role: IDBROKER_ROLE is assigned the iam role: cdp-ljm-idbroker-assume-role which has been granted: assumeRoles
 The datalake role: LOG_ROLE is assigned the iam role: cdp-ljm-log-role which has been granted: read-write for path: /ljm/logs
 The datalake role: RANGER_AUDIT_ROLE is assigned the iam role: cdp-ljm-ranger-audit-s3-role which has been granted: full-object-access for path: /ljm/ranger/audit
 The datalake role: RANGER_AUDIT_ROLE is assigned the iam role: cdp-ljm-ranger-audit-s3-role which has been granted: list-only for path: /ljm/data
 The datalake role: DATALAKE_ADMIN_ROLE is assigned the iam role: cdp-ljm-admin-s3-role which has been granted: full-access for path: /ljm

Generated datalake directories:

 lmccay@strange:~/Projects/datalake-def$ ls datalakes/ljm/
 AWS/      ddf.yaml  

Generated AWS Policy Artifacts:

 lmccay@strange:~/Projects/datalake-def$ ls datalakes/ljm/AWS/
 cdp-ljm-admin-s3-role-policy.json  cdp-ljm-idbroker-assume-role-policy.json  cdp-ljm-log-role-policy.json  cdp-ljm-ranger-audit-s3-role-policy1.json  cdp-ljm-ranger-audit-s3-role-policy.json

Generated YAML DDF (Datalake Definition File):

 lmccay@strange:~/Projects/datalake-def$ cat datalakes/ljm/ddf.yaml 
 datalake: ljm
             iam_role: cdp-ljm-idbroker-assume-role
             instance_profile: true
                 - "sts:assume-roles"
             iam_role: cdp-ljm-log-role
             instance_profile: true
                 - "storage:read-write:LOGS_LOCATION_BASE"
             iam_role: cdp-ljm-ranger-audit-s3-role
                 - "storage:full-object-access:RANGER_AUDIT_LOCATION"
                 - "storage:list-only:DATALAKE_BUCKET"
             iam_role: cdp-ljm-admin-s3-role
                 - "storage:full-access:STORAGE_LOCATION_BASE"
                 - "db:full-table-access:ljm-table"
             # main data directory
             path: /ljm
             # main data directory
             path: /ljm/data
             # ranger audit logs
             path: /ljm/ranger/audit
             # logs for fluentd usecases
             path: /ljm/logs
             # logs for fluentd usecases
             path: /ljm
             rank: 1
             description: the force
             rank: 2
             description: jedi master
             rank: 3
             description: jedi knight
             rank: 4
             description: padawan
             rank: 5
             description: youngling
             rank: 1
             description: shapeshifter
             rank: 1
             description: dba

Azure Setup

Following steps describe a way to set up and run scripts for Azure.


Environment Setup

  • Sign-in to Azure account

      az login
  • Create Service Principal (if not already created) - make sure the SP has Owner role at subscription level

      az ad sp create-for-rbac --name KnoxSP --password knox-password > local-sp.json

    NOTE: If you explicitly want to set permissions use the option --skip-assignment and assign Owner permissions at subscription level later. NOTE: Service Principal should have Owner permissions at subscription level

  • Get subscription id

      az account show 
  • Setup environemnt variables (update values as necessarily) - Required

  • Setup environemnt variables (update values as necessarily) - Optional

      AZURE_RESOURCE_GROUP="myResourceGroup" #Resource group under which MSIs will be created, else default is <datalakename>RG

For more information see Azure Configure Authentication docs


  • Create a default DDF

      ddt.py "set debug true" "new_datalake -n srm" "build_datalake -c Azure" "push_datalake -c Azure" quit

Google Cloud Platform Setup

Set up and run scripts for Google Cloud Platform (GCP).


  • A Google Cloud Platform account with permissions associated with the following GCP roles:
    • StorageAdmin (bucket/path creation/removal)
    • ServiceAccountAdmin (service account creation/removal)
    • SecurityAdmin (IAM policy attachment)

Environment Setup

  • Download the key file for the Google Cloud Platform account to be used by the Datalake Definition tool.
  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to that key file.
  • Set the DDF_LOG_LEVEL environment variable to DEBUG to get debug-level log messages from the GCPFactory.


N.B. Datalake names are translated into Google Cloud Storage bucket names, which must be globally unique. Attempts to access buckets which don't belong to you will result in HTTP 403 errors, even though your permissions include the StorageAdmin role.

  • Create a Google Cloud Platform datalake definition from the templates

      ddt.py "new_datalake -n mydl" "build_datalake -n mydl -c GCP" quit
  • Push a Google Cloud Platform datalake definition

      ddt.py "push_datalake -n mydl -c GCP" quit
  • Recall a Google Cloud Platform datalake definition

      ddt.py "recall_datalake -n mydl -c GCP" quit