dbt docs generate is very slow if many tables exist in the catalog, even if they are not used in the dbt project
michiel-de-muynck opened this issue · 1 comments
Expected behavior
When I create a dbt project which writes to multiple schemas (e.g. by configuring seeds to be written to schema with suffix _seeds
) and run dbt docs generate
, I expect the run-time to depend on the number of tables in these target schemas.
Actual behavior
The run-time depends on the total number of tables in the catalog. For instance, if ~3000 tables exist in the catalog, dbt docs generate
can take around 10 minutes.
The reason is that the get_catalog
macro reads system.metadata.table_comments
and filters using schema_name in ('schema_1','schema_2', etc)
. However, Trino does not push down this filter and instead reads all table comments from all tables, which can be incredibly slow (processing only ~15 tables per second).
A possible fix, which I will add a PR for shortly, is to rewrite that query using union all
, as follows:
select ... from system.metadata.table_comments where ... and schema_name = 'schema_1'
union all
select ... from system.metadata.table_comments where ... and schema_name = 'schema_2'
union all
etc.
By doing this, Trino reads only the table comments from the tables in the selected schemas, and running dbt docs generate
is much faster, taking only a few seconds if the dbt project only contains a few tables.
Steps To Reproduce
- Create an Iceberg catalog. Create several thousand tables in it.
- Create a dbt project which uses the catalog as its target. Add a seed. Configure in
dbt_project.yml
to write the seed to its own schema as follows:
seeds:
trinotest:
schema: seeds
- Run
dbt seed
- Run
dbt run
- Run
dbt docs generate
- Observe that it takes 10 minutes.
Log output/Screenshots
11:13:18 Running with dbt=1.4.1
11:13:18 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
11:13:18
11:13:20 Concurrency: 1 threads (target='dev')
11:13:20
11:13:20 Done.
11:13:20 Building catalog
11:21:24 Catalog written to /home/myusername/trinotest/target/catalog.json
Operating System
Ubuntu 22.04.1 LTS
dbt version
1.4.1
Trino Server version
407-e.0
Python version
3.10
Are you willing to submit PR?
- Yes I am willing to submit a PR!