Course for doing databricks dataops, based on a data mesh monorepo structure
- All members must get commit access to:
- the repo from the teacher
- the databricks training workspace
How can we deploy Databricks data pipelines in a way that is:
- (git-)versioned
- usable
- ordered and sustainable
- enabling decentralized domain ownership for each data domain and team
- this somehow enables data mesh-like principles
- way of working for exploration, development, staging and production of pipelines
We will do our tasks in the context of the folder representing the revenue data pipeline or flow:
orgs/acme/domains/transport/projects/taxinyc/flows/prep/revenue/
The structure is a proposal, which might have to be adapted in a real world organization.
The structure is:
- org:
acme
- domain:
transport
- project:
taxinyc
- flowtype:
prep
(meaning ETL/data engineering, the alternative isml
, for ML work)- flow:
revenue
- flow:
- flowtype:
- project:
- domain:
The structure will be applied to:
- Data code, i.e. the pyspark code herein git
- The database tables produced by that code
- The data pipelines being deployed
The purpose of this structure is to have sufficient granularity to enable each department/org, team/domain, project and pipeline, to be kept apart.
You can explore the structure here in Databricks, or more easily in the repo with a browser.
A longer explanation of the ideas behind the repo structure can be found in the article Data Platform Urbanism - Sustainable Plans for your Data Work.
There are python libs in the libs folder, to enable a versioned pipeline deployment and way of working. The main logic is in libs/dataops/deploy, and there are tests under libs/tests.
The structure and dataops libs can be used in your own projects, by forking the repo or copying the content and adapting it.
- Go to course/
- DO NOT run anything under course/00-Workshop-Admin-Prep
- Got to course/01-Student-Prep/01-General
- Go through the instructions under that folder
- Start with the tasks under 02-DeployTasks
- Some sections are just for reading or running, others you need to solve.