starburstdata/dbt-trino

dbt docs generate is very slow if many tables exist in the catalog, even if they are not used in the dbt project

michiel-de-muynck opened this issue · 1 comments

Expected behavior

When I create a dbt project which writes to multiple schemas (e.g. by configuring seeds to be written to schema with suffix _seeds) and run dbt docs generate, I expect the run-time to depend on the number of tables in these target schemas.

Actual behavior

The run-time depends on the total number of tables in the catalog. For instance, if ~3000 tables exist in the catalog, dbt docs generate can take around 10 minutes.

The reason is that the get_catalog macro reads system.metadata.table_comments and filters using schema_name in ('schema_1','schema_2', etc). However, Trino does not push down this filter and instead reads all table comments from all tables, which can be incredibly slow (processing only ~15 tables per second).

A possible fix, which I will add a PR for shortly, is to rewrite that query using union all, as follows:

select ... from system.metadata.table_comments where ... and schema_name = 'schema_1'
union all
select ... from system.metadata.table_comments where ... and schema_name = 'schema_2'
union all
etc.

By doing this, Trino reads only the table comments from the tables in the selected schemas, and running dbt docs generate is much faster, taking only a few seconds if the dbt project only contains a few tables.

Steps To Reproduce

  1. Create an Iceberg catalog. Create several thousand tables in it.
  2. Create a dbt project which uses the catalog as its target. Add a seed. Configure in dbt_project.yml to write the seed to its own schema as follows:
seeds:
  trinotest:
    schema: seeds
  1. Run dbt seed
  2. Run dbt run
  3. Run dbt docs generate
  4. Observe that it takes 10 minutes.

Log output/Screenshots

11:13:18 Running with dbt=1.4.1
11:13:18 Found 2 models, 4 tests, 0 snapshots, 0 analyses, 310 macros, 0 operations, 1 seed file, 0 sources, 0 exposures, 0 metrics
11:13:18
11:13:20 Concurrency: 1 threads (target='dev')
11:13:20
11:13:20 Done.
11:13:20 Building catalog
11:21:24 Catalog written to /home/myusername/trinotest/target/catalog.json

Operating System

Ubuntu 22.04.1 LTS

dbt version

1.4.1

Trino Server version

407-e.0

Python version

3.10

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Resolved by #273