mattarderne/rdrn.dev

dataform-and-dbt/

Opened this issue · 19 comments

dataform and dbt - Matt Arderne

A quick rundown on two of the “indicative-of-the-future” SQL tools in data analytics at the moment. Dataform and dbt. Welcome to my third post, one I have wanted to write from the beginning. Getting these posts done isn’t easy, and the time between publishing is a commitment that I undertook rather lightly. Like most good ideas, this one is late, irrelevant, and likely only to be marginally useful. That said, here is a quick rundown on two of the “indicative-of-the-future” SQL tools in data analytics at the moment

https://rdrn.dev/dataform-and-dbt/

Thank you for your reflective article. After reading it I know now which tool I am gonna choose. Although I am a seasoned GCP developer I will go for dbt. The database in our current application is Azure DWH. And also I prefer jinja over javascript.
When it comes to the broder picture I have a slighly different world view. You write "The modern ELT data stack that no one is going to second guess has Snowflake as the data warehouse, Fivetran doing the ExtractLoad and dbt doing the T for transforms." I am second guessing this a little bit. From my perspective BigQuery is the ultimate datawarehouse product as it beats all its competitors in price, performance and usability (especially no indexing). Fivetran has only barely crossed my mind – as one of many web-based ETL tools. Airflow and Dataflow I value for their purposes - orchestration and massiv-parallell ingestion.

Hi @Ture2019 thanks for the comment, and feedback.

I think Snowflake vs BigQuery is possibly a matter or preference, as I've seen reviews favouring both for price and performance in different contexts. I can definitely say that Snowflake is especially good for usability (no maintenance whatsoever), but this is very much in line with BigQuery, and so again I say that it is probably preference or context that dictates which one is a better choice. I work with both almost daily and think they are both excellent!

Fivetran is worthwhile considering for the peace of mind, I will only consider alternatives if there is a cost sensitivity or no supported connector, but only as an Extract and Load tool (T for transform is for dbt/dataform). But I have not used Dataflow, so perhaps overlap.

Finally - the quote "no one is going to second guess" is maybe a poor phrase that I chose - I mean by that that those are a perfectly acceptable choice, but not strictly the absolute best choice!

Hey Matt, thanks for the insightful article! I was particularly struck by your description of some of the people getting started with the modern data stack:

Thus, dbt/Dataform cloud is my primary point of consideration, because that has been my primary interest. Both are extensively used as CLI tools, which for many is the go-to. My focus has been in enabling teams that initially do not have the time/resources to upskill/maintain/support the tech, and so my considerations are within the context of the cloud offering. This may be contentious, but on average the analysts who come from Bizops or something of that nature seem to make for more rounded data analysts, and enabling them is key.

There's definitely some set of users that are getting involved with data pipelines and don't want to touch a command line or bother a data engineer, and dbt Cloud + Snowflake + SQL makes that possible. But the moment they want to do anything outside of SQL (such as building a predictive model) they need to set up Airflow/another orchestrator, which is probably not inside the expertise of business analysts or even data scientists. Do you think there are people running into that barrier? Or is no/low code BI most of their focus?

Hey @rachtsingh thanks for the question! I definitely think this is a huge barrier that many are hitting.

I think you've picked up on something very important. The clarity with the architecture of the SQL use cases is so clear, but stepping outside that world leads to a very messy place indeed. This is so neatly reflected by this twitter thread from the team at fal.

If you look at that unbundling of Airflow (excluding the dbt+ELT), you can see a huge number of additional tools that are variously trying to fill this gap.

I think the simplest answer is that this space is too messy currently for there to be a simple "use this tool" approach, but I do have two relatively clear thoughts:

  1. Tools like fal that extend the capabilities of the typical dbt user feel appropriate, as they require the least additional infra and tech for users.
  2. The typical dbt user may just possibly be interested in implementing an orchestrator. I think this is a logical point of interest for maybe half of dbt developers. With Dagster and Prefect making this less daunting, I think they should try!
  3. I think there will be a few clear best approaches in the space of "basic orchestration + python" space. However as you can see from the Airflow diagram, there is still some way to go - for example a simple prediction model would encompass at least 3 of the tools on that map (tinybird, streamlit). I still think Fal is an interesting one, as it doesn't need an orchestrator!

If you like or are already using Snowflake, then Dataform is no longer an option.

What's the rationale for this statement? Has it been announced that Dataform will become GCP-exclusive or is that implicit as part of their acquirement?

@james-mead it was announced to customers and probably publicly somewhere.

Dataform is in private preview for BigQuery 🎆
https://cloud.google.com/dataform/

Thanks for sharing this article, it was really interesting and digestible even for me as a Junior Data Engineer who is just starting out and is getting to grips with all the tools and ideas in the data landscape!

I'm wondering, if you had a situation where your data was split across Redshift and BigQuery (or any other platforms), would you advise using DBT over Dataflow since it would provide support for both?

Hey @JustinMyerson no problem thanks for commenting!

I would suggest starting with dbt over DataFORM, as it might cater better to your situation, though be aware that it might not help in creating the unified project you are hoping for:

Just a quick note, you say that it's GCP only, but in the GCP Docs I see this:

Dataform core is part of the open-source Dataform data modeling framework that also includes Dataform CLI. You can compile and run Dataform core locally through Dataform CLI outside of Google Cloud to deploy assets to the following data warehouses:

  • BigQuery
  • Snowflake
  • Redshift
  • Azure SQL Data Warehouse
  • Postgres

It also links to

Git Repo of DataForm
DataForm CLI for local development

Which in turn informs us:

This document shows you how to use the open-source Dataform command line interface (CLI) to locally develop SQL workflows by using the terminal.
With the open-source Dataform CLI, you can initialize, compile, test, and run Dataform core locally, outside of Google Cloud to deploy assets to the following data warehouses:

  • BigQuery
  • Snowflake
  • Redshift
  • Azure SQL Data Warehouse
  • Postgres
    Dataform distributes a Docker image which you can use to run the equivalent of Dataform CLI commands.

Thanks @RichMarmalite, I specify near the top and again in the TLDR that I'm discussing the cloud versions. Both are open-source core, and hopefully the comments now make that clear. Appreciate the comment 👍

Thus, dbt/Dataform cloud is my primary point of consideration, because that has been my primary interest. Both are extensively used as CLI tools, which for many is the go-to.

Both tools have a SaaS offering, effectively a cloud IDE + CICD. Great for getting connected and developing quickly. Both happily run a hybrid of cloud IDE and open-source core.

Great article, Matt! I've been following the dbt and Dataform evolution very closely as well. I'm strong believer in competition, and do believe it is a good thing to have a dbt alternative out there.

I'm curious about Dataform vs dbt traction, and how the Google acquisition will impact the Open Source version of Dataform. I know Dataform closed their Slack community, and it's only limited to @dataform.co and @google.com domains. Is there any other community for people developing Dataform worth joining?

Thanks @elyisu (and for making a Github account to comment?)

I think it is likely seen as a feature for BigQuery and thus requires little cultivation around OS adoption for it to remain viable. Is there any precedent? I think Malloy is an interesting one, an OS query language with a community, on the back of a product aquired by Google. This makes me think it is within policy to allow a community.

I don't know of any communities! (The Dataform one is open to existing accounts afaik)

Hey Matt, thanks a lot for your post, I share your excitement in seeing how both DBT and Dataform will evolve! Would be awesome to see a follow up post maybe a year from now to see what has changed.

An additional note I would like to make is unit testing in datapipelines. Nowadays datapipelines can get quite complex and long, which brings the need for proper testing. A big plus for dataform is that this is integrated into the ecosystem: https://docs.dataform.co/guides/tests. For DBT this has been made possible through a package, which I believe is suboptimal (I would rather have this out of the box available in DBT): https://github.com/EqualExperts/dbt-unit-testing

Two other things I find very promising about DBT over dataform are the two materializations that DBT offers and dataform does not. By this I mean:

  • Ephemeral materializations (https://docs.getdbt.com/docs/build/materializations#ephemeral). This one really helps us in writing component based datapipelines and keeps queries small and testable. I haven't found an alternative for this in dataform, you can only materialise as a view or table, but I cannot find anything about how to reuse Common Table Expressions (CTE, also known as With statements).

  • Python materializations (https://docs.getdbt.com/docs/build/materializations#python-materializations). This has the potential for us to perform statistical calculations during our datapipeline, which would save a lot of complexity for us down the road. I really like that DBT will go down the python route, which seems to be the go to language in the dataworld. I see this as a plus over dataforms plain javascript solution, although that also offers powerful capabilities.

Hey @the-serious-programmer , thanks for the comment! Perhaps in a year or two I'll have some useful insights to share.

Agree on both of those points, Dataform never really got beyond the basics with materialisations, perhaps they will more tightly align with BQ now that they can focus on it.

New dbt IDE, and hopefully much more:

https://www.paradime.io/code-ide