duckdb/dbt-duckdb

Database increases by initial size after each dbt run is executed

simtbak opened this issue ยท 8 comments

I have a project running on Windows 10 that contains only views (no tables materialized) that select from sources using meta_external-location read_parquet statements. I have found if I delete the database and then successively execute dbt run commands the database file grows in size by approximately its initial run size.

The size of the database file for each successive run is:

  1. 29,964 kb
  2. 58,380 kb
  3. 86,796 kb
  4. 115,212 kb

Thanks for all your work on this, I'd love to contribute but I'm not sure where to start looking

jwills commented

ooh that's a fun one, let me try to repro it on OS X; we should also check upstream in the duckdb issues to see if this is a known problem.

jwills commented

Ah okay, so it looks like the initial doubling (from an empty DB file to a re-run of an existing file) is a known issue and is related to 1) the ACID semantics of the database write (i.e., DuckDB won't delete the old data until it knows the new data is successfully created) and 2) the lack of a vacuum command to cleanup the freed blocks of data from the old file.

Some context/discussion here from tjhe DuckDB discord in October 22: https://discord.com/channels/909674491309850675/921073327009853451/1031941435815112776

The recommended workaround in that discussion is to do an EXPORT DATABASE followed by an IMPORT DATABASE as a substitute for the VACUUM: https://stackoverflow.com/questions/66027598/how-to-vacuum-reduce-file-size-on-duckdb

We could maybe add that sequence as a vacuum macro that could be done on-run-end to clean stuff up?

jwills commented

The fact that it keeps increasing on windows surprises me tho, going to see if there is something i can find about it in the discord...

I did a test exporting the DB on-run-start and importing it on-run-end

{% macro exportdb() %}
	EXPORT DATABASE 'test_export';
{% endmacro %} 

{% macro importdb() %}
	IMPORT DATABASE 'test_export';
{% endmacro %}

Both hooks run successfully but the database still doubles. I saw that while the schema.sql export file is populated correctly with the models like:

/* {"app": "dbt", "dbt_version": "1.4.5", "profile_name": "duckdb", "target_name": "dev", "node_id": "model.project.v_test"} */

  create view "project"."main"."v_test__dbt_tmp" as (
    select 
   "column"
from read_parquet("../data/file.parquet")
  );

The load.py file is empty, I guess this is because the parquet file data shouldn't be loaded into the DB, its a view that reads directly from the external file source. If a table was created with data in it the load.py file would contain some COPY statements.

I guess this adds to the puzzle- DuckDB doesn't recognise that it's storing any data other than the view definitions, which only add up to 90kb in the test_export/schema.sql

I modified the exported schema.sql file to repoint it from project to a new database called recreated and removed all of the __dbt_tmp suffixes, then reran directly in DuckDB via python

>>> import duckdb
>>> db = duckdb.connect('recreated.duckdb')
>>> db.execute(open('test_export/schema.sql').read())
<duckdb.DuckDBPyConnection object at 0x0000020D5A3F1AF0>
>>> db.sql('select count(*) from v_test')
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ count_star() โ”‚
โ”‚    int64     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚        64374 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

>>> db.close()
>>>

After creating all the sources in schema.py and running the select count(*) the database recreated.duckdb was only 12kb, after db.close() it went up to 780kb. I guess after a session duckdb records and stores query indexing info. This makes more sense than storing all of the parquet data in the DB as tables because the collective file size of the parquet's I reference in sources.yml is >700MB.

So I imagine when dbt run is called, staging views are run from those sources and while no actual table data is stored, those queries add weight to the db in terms of query indexes. Perhaps the reason the database keeps increasing in volume is because that index/query plan optimisation data is based on views suffixed __dbt_tmp which get renamed after executing

Here's a stranger update:

If the pipeline is run with a seed file present, the DB only reaches ~2000kb and doesn't grow at all with each successive dbt run

When I remove the seed file and run the database increases as detailed in the first post

๐Ÿคท

jwills commented

It doesn't grow because DuckDB re-uses the unallocated disk space that it originally allocated in order to perform MVCC operations when it loaded the seed file.

jwills commented

Looks like a mechanism to shrink the DB size via checkpoints is on the way: duckdb/duckdb#7824