dagster-io/talks

[motherduck-demo] it's not clear how a user could configure this project to run locally

Opened this issue ยท 16 comments

Description

There are two configuration files that have to be modified to allow this project to run locally (./.env and ./reports/sources/dagster_hybrid_demo/connection.yaml), and this isn't very clear to the user.

Firstly, setting DUCKDB_DATABASE="local.duckdb" as is explained in the .env.example file will result in the dbt assets failing to materialize as I believe dbt assumes this path is relative to its runtime directory, meaning it can't locate the database. Instead, it's mandatory to use the absolute path in the .env file like so: DUCKDB_DATABASE="/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/local.duckdb".

Secondly, the EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME environment variable isn't referenced in the evidence project files isn't respected by evidence meaning it will fail to find the local database and dagster will fail to materialize the evidence assets. Because of this it's necessary to modify the connection.yaml file in the evidence sources directory, where you unintuitively have to use a relative path, using an absolute paths will result in evidence modifying the path provided at run time which causes the same error.

Proposed Solution

Firstly we can change the instructions in .env.example to properly indicate how to configure the project for local use, using absolute paths in the .env file.

Secondly we can either find a way to configure evidence such that it reads from the EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME environment variable properly, or give users instructions to modify the connection.yaml file to point to the local duckdb file using a relative path.

Will submit a PR fix tomorrow.

From the Evidence documentation it's clear that the evidence environment variable is meant to be automatically picked up by evidence, but it's not respecting it for some reason, may look into this further.

https://docs.evidence.dev/reference/cli/#environment-variables

From the Evidence documentation it's clear that the evidence environment variable is meant to be automatically picked up by evidence, but it's not respecting it for some reason, may look into this further.

https://docs.evidence.dev/reference/cli/#environment-variables

Regarding the evidence environment variable issue -- I also experienced this in that Evidence was not respecting the environment variables being set despite it matching their documentation. I got around this by defining the connection in the /settings page of the Evidence project, but it would be much more ideal if the environment variable could be used, and the user did not need to perform any manual steps.

The evidence team is very responsive on Slack, and I'll shoot a message there way to get their input.

I've started a thread here if you would like to follow along:
https://evidencedev.slack.com/archives/C023LRB9Z40/p1714315726200319

Unfortunately I'd have to ask the evidence team for an invite:

image

I noticed from their documentation the env variable is case sensitive, it's defined as EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME where maybe it should be EVIDENCE_SOURCE__dagster_hybrid_demo__filename, I'm testing a few other changes at the moment but will try that shortly.

Example below, seems like the case sensitivity may be the issue. Tested this running the commands myself but will now try changing the env variable in the .env file to see if dagster properly exports it before running evidence.

(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ export EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME="../../../data/db/local.duckdb"
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ echo $EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME
../../../data/db/local.duckdb
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ make evidence-sources
npm --prefix ./reports run sources

> my-evidence-project@0.0.1 sources
> evidence sources

(node:28943) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Processing dagster_hybrid_demo
Missing required duckdb option 'filename' (/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/reports/sources/dagster_hybrid_demo)
  all_ducks โœ– Catalog Error: Table with name all_birds does not exist!
Did you mean "system.information_schema.tables"?
LINE 20: from main.all_birds
              ^
  top_ducks_by_region โœ– Catalog Error: Table with name top_ducks_by_region does not exist!
Did you mean "duckdb_indexes"?
LINE 8: from main.top_ducks_by_region
             ^
  top_ducks_by_state โœ– Catalog Error: Table with name top_ducks_by_state does not exist!
Did you mean "duckdb_tables"?
LINE 7: from main.top_ducks_by_state
             ^
  top_ducks_by_year โœ– Catalog Error: Table with name top_ducks_by_year does not exist!
Did you mean "duckdb_types"?
LINE 9: from main.top_ducks_by_year
             ^
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ export EVIDENCE_SOURCE__DAGSTER_HYBRID_DEMO__FILENAME="../../../data/db/local.duckdb"
export EVIDENCE_SOURCE__dagster_hybrid_demo__filename="../../../data/db/local.duckdb"
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ echo $EVIDENCE_SOURCE__dagster_hybrid_demo__filename
../../../data/db/local.duckdb
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ make evidence-sources
npm --prefix ./reports run sources

> my-evidence-project@0.0.1 sources
> evidence sources

(node:32066) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Processing dagster_hybrid_demo
  all_ducks โœ” Finished. 12782 rows
  top_ducks_by_region โœ” Finished. 30 rows
  top_ducks_by_state โœ” Finished. 891 rows
  top_ducks_by_year โœ” Finished. 1738 rows
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$

Okay setting the environment variables like so in the .env file fixed it, will submit a PR with proper instructions:

EVIDENCE_SOURCE__dagster_hybrid_demo__type="duckdb"
EVIDENCE_SOURCE__dagster_hybrid_demo__filename="../../../data/db/local.duckdb"

Okay setting the environment variables like so in the .env file fixed it, will submit a PR with proper instructions:


EVIDENCE_SOURCE__dagster_hybrid_demo__type="duckdb"

EVIDENCE_SOURCE__dagster_hybrid_demo__filename="../../../data/db/local.duckdb"

Nice! Fantastic job getting to the bottom of that. That really simplifies getting up and running.

cmpadden it'd still be really cool if the evidence team provided some advice on absolute paths, they'd be preferable because having to specify the environment variable as a relative path from the reports/sources/project_name/ folder isn't super user friendly, their docs say that your datasource should be stored in that same directory but that's not really realistic for projects where you're using multiple tools.

For reference this is what happens if you use an absolute path:

(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ export EVIDENCE_SOURCE__dagster_hybrid_demo__filename="/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/data/db/local.duckdb"
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ make evidence-sources
npm --prefix ./reports run sources

> my-evidence-project@0.0.1 sources
> evidence sources

(node:31327) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Processing dagster_hybrid_demo
[!] dagster_hybrid_demo failed to connect; Catalog Error: Cannot open database "/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/reports/sources/dagster_hybrid_demo/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/data/db/local.duckdb" in read-only mode: database does not exist
make: *** [Makefile:23: evidence-sources] Error 1
(venv) maxisq@MS-V-Desktop:~/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute$ echo $EVIDENCE_SOURCE__dagster_hybrid_demo__filename
/home/maxisq/dev/dagster-devrel-fork/motherduck-dagster-hybrid-compute/data/db/local.duckdb

You can see that for some reason evidence injects the source directory into the path, which is redundant if you're doing an absolute path, probably something to do with their docs saying that they assume the datasource will be stored in that same source directory so their default method is just to prepend that path themselves rather than checking the one provided first.

Unfortunately I'd have to ask the evidence team for an invite:

image

Missed this message, but you should be able to join at this link - https://slack.evidence.dev/

Dropping a timestamped video link here as an option for solving this issue, not 100% certain this could solve it just thinking out loud.

We could specify a LOCAL and PROD dictionary for environment variables that may enable us to seamlessly deploy, just an idea.

https://www.youtube.com/live/kpco5u1zG9Y?si=zfSZIrlZR3e4pW1d?t=30m14s

Seems like the IO managers and having separate environment variables for local vs prod may be the way after watching that vid, will test it out myself when i get time outside my day job.

Seems like the IO managers and having separate environment variables for local vs prod may be the way after watching that vid, will test it out myself when i get time outside my day job.

Just one note here is that we are de-emphasizing the use of IO managers, so I don't think I'd necessarily want to start leveraging that in this demo. But I think we should be able to do the local/prod environment variables with our resources.

Oh interesting, why is it that you're de-emphasising their use? Not to worry will work out another way.

Oh interesting, why is it that you're de-emphasising their use? Not to worry will work out another way.

They're still a totally valid approach, but I think they can be a bit "magic" in how they operator, so they're maybe not the most intuitive demonstration for people who are new to Dagster. I'm not opposed to using them, but want to make sure the code remains easy to understand!