datonic/datadex

Governance and Catalog

Closed this issue · 9 comments

I was thinking of building a "poor's man data platform" when I found datadex and it is almost everything in my wishlist.
But in a scenario where you may have many data-products, I miss a place to discover and document them. Have you considered adding a discovery and governance layer to the mix? I am thinking about Datahub or Open-metadata.

P.S. Not sure if this was the right way to ask. ¯_(ツ)_/¯

This is definitely the right way to ask!

Glad to hear Datadex fits what you wanted to build and also, great questions.

in a scenario where you may have many data-products, I miss a place to discover and document them

Not entirely sure what you mean by data-products there but if you're talking about models/datasets. These are some of my thoughts:

  • Models can be documented both via dbt Docs or by manually creating a Notebook that loops over the datasets (example on one implementation of Datadex).
  • Quarto is super flexible and you can totally make a catalog or documentation with it.

These are very manual approaches. To be honest is not something I've spent much time thinking about. Would love to explore it together and learn about tools or approaches that improves this process!

Will try to spend some time with Datahub or Open Metadata and report back. Also, feel free to fork the repo and try anything that might fit your needs!

I define a data product as the result of a data engineering process, akin to how software results from software engineering. A data product can manifest as a table (e.g., parquet, csv, Excel, duckdb file), a report, a dashboard, a dataset... you name it.

The term 'asset' isn't quite fitting, as it suggests value in holding, whereas a data product's value lies in its use. But to be used, a data product must also be discoverable, trustworthy, understandable. Beyond its physical form, a data product requires metadata detailing its code, ownership, documentation, tests, etc.

To contextualize, I work in the Brazilian Ministry of Health, in a department creating economic evidence to inform policies. Our outputs are studies, reports, dashboards, and datasets, serving internal policy makers and external academics and think tanks. Our diverse team, including economists, public health experts, and pharmacists, primarily uses Excel and ad-hoc SQL scripts, with some employing R or Python. However, our process needs refinement.

My aim is to enhance team efficiency and "developer experience" by introducing streamlined tools and processes.

Regarding governance and cataloging, I wonder if Dagster is the answer as we need a solution for external users to access our data products, complete with documentation and metadata, without exposing the underlying processes. I've explored Datahub, Open Metadata, and Open Data Discovery, but they seem overly complex for our needs.

Thanks so much for the clear explanation @fredguth!

Beyond its physical form, a data product requires metadata detailing its code, ownership, documentation, tests, etc.

I think that should be something Dagster can take care of. Actually, I wanted to rely on Dagster for similar things and created this issue: dagster-io/dagster#17807.

I feel having Dagster + Quarto in the same static website can help people navigate all the data products!

I will consider that... I actually like quarto very much and there is always an overhead when you add one more tool. I have been also considering it to generate the pdf of the reports. I recently have been using Typst.app (think of it like a LaTeX replacement) a lot and Quarto less and less. But I heard Quarto now can render Typst templates.

Quarto and Evidence also seem to have some overlap. I am new to Evidence, but I like it is built in Svelte which I can understand better than Quarto source code.

Nice! Been thinking about Typst recently too.

But I heard Quarto now can render Typst templates.

Yes! It's on their prerelease channel though.

Quarto and Evidence also seem to have some overlap

💯 I think Quarto is a bit more generic as it can render JS, SQL, Python, ... but Evidence looks better and has a better DX. I'm struggling between adding it or keeping it simple 🙈

Feel the same. I will keep Evidence just because I understand its source code better and I like the team and vision there.

Typst is now part of the official release and I already created a Typst template :-)