Discrepancy Between Manifest & Catalog - Non-Defined Columns Missing

Question

Discrepancy Between Manifest & Catalog - Non-Defined Columns Missing

sisu-callum opened this issue 4 years ago · 3 comments

While looking at the results of this last version 0.7.0, I noticed that one of my models only had a single column defined within the LookML. I dug into it and realized that this is because I only defined a single column within my schema.yml for that model. I dug into the manifest of the specific model model using this command cat target/manifest.json | jq '.nodes["model.my_project_name.model_name"]' and noticed that only a single column is returned in the manifest.

I then pulled up the catalog with this command cat target/catalog.json | jq '.nodes["model.my_project_name.model_name"]' and there were 229 columns returned.

I'm not sure if this is an opinionated stance that has been taken (ie, columns that make it to the Looker view should be defined within the corresponding schema.yml) but wanted to highlight just in case it wasn't intentional.

Answer 1 · 2021-04-18T21:44:13.000Z

Thanks for raising this one (came up even earlier than I thought!)

So this was an intentional decision such that only resources declared in dbt appear in looker. This really suits the way we work but might not be for everyone.

I'm quite curious though what would be most useful for your team? I guess you were at least expecting columns to appear for models declared in dbt. What about tables in the warehouse but not in the dbt schema?

Answer 2 · 2021-04-18T22:00:50.000Z

I'm a big fan of this decision. My personal ethos is that teams should aim for 100% column coverage, as thats the best way to pass on domain knowledge.

That being said, 100% column coverage is a bit of an aspirational thing and could potentially impact teams that want to iterate model --> lookml creation quickly and programmatically with dbt2looker. I'm imagining a scenario where someone has model_1 and model_2 that create model_3. They use dbt_utils.star in model_3 to get all columns from both models. If they added 2 columns to model_1 and didn't document them, I could see that potentially being an issue that isn't caught until some raises a flag with the new dimensions not appearing in the LookML.

I think the best solution (and probably the most work to be honest) would be the creation of a dbt2looker.yml file where this could be defined at the project level. Default setting would be the current as-is state of not bringing in fields that aren't declared but this could be changed to include everything. That gives different teams the flexibility to choose an option that fits their style of development and documentation.

Answer 3 · 2021-05-03T02:23:48.000Z

perhaps this could be another meta option, eg a default like infer-looker-dimensions=false and then if you flip the switch, automatically generate column descriptions indicating that the field was machine generated, because fundamentally we want some sort of guard to caution users when they interact with data that has not been explicitly documented.

I think this might also solve for the open question discussed in dbt slack, since there's not a huge distinction between columns and subcolumns from dbt's perspective.