elementary-data/elementary

[ELE-34] Postgres Integration

kwanUm opened this issue · 14 comments

Hi guys, congrats on YC!

I'm looking to test out elementary, so was running it with good ol' jaffle_shop dbt toy example.
I'm however running into errors after installing elementary deps and "dbt run".
(I'm running with a local postgres db and admin permissions).

I guess it doesn't support postgres just yet?

image
image
dbt.log

ELE-34

Hi @kwanUm,
Thank you!
Yes, we don't support Postgres yet, but working on it. Thank you for sharing the artifacts, I will try to recreate on our test env.
Are you working with dbt on postgres in prod as well?

Copying message from https://github.com/cons0l3 in an issue on the dbt package repo (issue):

Thanks for all the work involved in creating a cool tool!

I would love to see support for "postgres".

  • OR -

I have setup my dbt profiles.yml to type "postgres". Elementary should check the current/active dbt profile for the configured type and "threw an exception"/"log an error" if the current type is unsupported.

repoduce

  1. install postgres 14, configure your dbt profiles.yml to point to the postgres
  2. run dbt run -s elementary
    OR
  3. run dbt test. I have a test set-up which uses the 'accepted_values' built-in test:
      - name: c_mktsegment
        tests:
          - not_null
          - accepted_values:
              values: ['MACHINERY', 'AUTOMOBILE', 'BUILDING', 'FURNITURE', 'HOUSEHOLD' ]

log of 2.

11:46:03  8 of 21 START table model dev_elementary.dbt_tests ............................. [RUN]
11:46:04  8 of 21 ERROR creating table model dev_elementary.dbt_tests .................... [ERROR in 0.51s]
11:46:04  9 of 21 START incremental model dev_elementary.elementary_test_results ......... [RUN]
[...]
11:46:05  13 of 21 START view model dev_elementary.metrics_anomaly_score ................. [RUN]
11:46:05  Warning: the `current_timestamp` macro is deprecated and will be removed in a future version of the package, once equivalent functionality is implemented in dbt Core. The elementary.metrics_anomaly_score model triggered this warning.
11:46:05  13 of 21 ERROR creating view model dev_elementary.metrics_anomaly_score ........ [ERROR in 0.11s]
[...]
11:46:05  Running 2 on-run-end hooks
11:46:05  1 of 2 START hook: elementary.on-run-end.0 ..................................... [RUN]
11:46:05  1 of 2 OK hook: elementary.on-run-end.0 ........................................ [OK in 0.00s]
11:46:06  Database error while running on-run-end
11:46:06  Encountered an error:
Database Error
  syntax error at or near "dummy_string"
  LINE 6: ...t\n            \n                \n        cast(\'dummy_stri...
[...]
dbt.exceptions.DatabaseException: Database Error
  syntax error at or near "dummy_string"
  LINE 6: ...t\n            \n                \n        cast(\'dummy_stri...

log for 3.

11:48:03  Running 2 on-run-end hooks
11:48:04  Database error while running on-run-end
11:48:04  Encountered an error:
Database Error
  syntax error at or near "MACHINERY"
  LINE 6: ...rom all_values\nwhere value_field not in (\n    \'MACHINERY\...
                                                               ^
11:48:04  Traceback (most recent call last):
  File ".conda\envs\dbt-play\lib\site-packages\dbt\adapters\postgres\connections.py", line 65, in exception_handler
    yield
  File ".conda\envs\dbt-play\lib\site-packages\dbt\adapters\sql\connections.py", line 70, in add_query
    cursor.execute(sql, bindings)
psycopg2.errors.SyntaxError: syntax error at or near "MACHINERY"
LINE 6: ...rom all_values\nwhere value_field not in (\n    \'MACHINERY\...

Thanks Carsten!

I agree with your point that we should think of a way to notify the user that his adapter is not supported.
We do it in the CLI, but there might be an easy way to do it on the dbt package as well (unfortunately dbt does not support it).

Regarding the integration with Postgres -
We currently don't officially support it.
However, the dbt package is built on top of dbt and supports Redshift which is very similar to Postgres, so it shouldn't be crazy hard to integrate with it.
That said, we prioritize integrations based on demand from the community, so we will track this issue and see if there is more demand.

You are of course welcome to try and implement it yourself, I'll be very happy to provide guidance!

Guidance for implementing Postgres integration:

Generally speaking, we implemented every platform-specific functionality using adapter.dispatch, as dbt recommends. You can see an example in this macro.
Anywhere there was an adapter native function or dbt_utils macro that we could use, we did.

As Postgres is officially supported by dbt, all the functionality we use in Elementary should be supported natively. Furthermore, Postgres and Redshift are very similar, and the Redshift dbt adapter searches for macros in this order:
Redshift -> Postgres -> Default (see the attached picture).

image (35)

This means that as a first step, you might want to change the adapters from Redshift to Postgres and execute. For everything that works - leave it that way. If a macro does not work, replace it back to Redshift and implement a Postgres specific one.

We recently decided (due to demand from the community) to add a Databricks integration, and approached it gradually -

Step 1 - Add support for uploading dbt artifacts and run results (in the dbt package).
Step 2 - Add support in the CLI for Slack alerts and UI generation.
Step 3 - Add support for data anomaly detection test (the most complex and platform-specific part of the code right now).

You can check out this PR where I implemented step 1 for Databricks. As you can see it actually required pretty minor changes.
If you want to give a shot with Postgres, I would be happy to support you!

Chiming in to provide some technical guidance.

In practice, I think that the necessary change is mostly doing a search and replace of redshift__ with postgres__ in our dbt package and everything should work out-of-the-box as Redshift and Postgres are the same database under the hood.
After making that change, let's validate that it works by running our E2E tests like so python dbt-data-reliability/integration_tests/run_e2e_tests.py -t postgres.

Hopefully that helps, let me know if you have any other questions.
Thanks!

Hi, I am interesting in trying to contribute in Postgres integration.

Thank you @kouridis, I just assigned the issue to you :)

Hi @elongl,

Current status of postgres - elementary integration

  • [18/20] postgres TESTS PASSED python integration_tests/run_e2e_tests.py -t postgres

  • TESTS FAILED

    1. dimension_anomalies: FAILED: dimension anomalies tests failed because it has to many fail/error tests python integration_tests/run_e2e_tests.py -t postgres -e dimension
    2. table_anomalies: FAILED: [] has different length than ['string_column_anomalies', 'numeric_column_anomalies', 'string_column_anomalies

I haven't figure out yet the cause or found a solution for the above issues.

There is one more problem that I have managed to work around for now but not with the appropriate way I think. I need your feedback on that.

Max table name length on Postgres is 63 characters. Because of the long prefixes in test schema tables and 63 chars limit some tables ends up with the same name.

In order to bypass this and check the actual outcome of tests, I made the following change here (kept the last 63 chars as table name instead of first 63)

{% set table_name_with_suffix = table_name[-(relation_max_name_length - suffix_length):] ~ suffix %}

I will appreciate any help on the 2 failing tests and also how do you suggest proceed from here?

Hi @kouridis ! Thanks for the update, great progress!

(kept the last 63 chars as table name instead of first 63)

Sounds to me like a good solution.

Regarding the failing tests, you can see what the test does and what it expects for table_anomalies and for dimension_anomalies.

I suggest you run the test (python integration_tests/run_e2e_tests.py -t postgres -e table or python integration_tests/run_e2e_tests.py -t postgres -e dimension) and compare the result of the query in the test to what it expects and try to understand what change causes that difference.

Both tests query alerts_anomaly_detection, which is elementary_test_results filtered by failing or errored tests.
It seems that the rows in elementary_test_results either don't appear or appear differently than on other adapters.

What do you think?

Hello! We're using postgres and would like to use elementary, and I can help out some. This has been quiet for a few weeks. @kouridis are you still working this and/or do you have a branch you could push?

Hi @ckdake, yes I am still working on postgres integration but others things slow me down a bit. Right now 19/20 tests pass. Here https://github.com/kouridis/dbt-data-reliability/tree/postgres-integration you can find my branch regarding postgres integration. Maybe @elongl can help to create a postgres-integration branch under elementary/dbt-data-reliability repo to push there my changes in order to be able to help if you like

@ckdake @kouridis

Definitely! I created the branch, you can create a pull request to it here.
Sounds like a great idea if you'd collaborate on the integration.
Of course I will also review the PR and share with you some of my thoughts.
Thanks!

Hi everyone!
I'm very glad to inform you that we're nearly ready to release support for Postgres thanks to @kouridis's implementation.
The team will review the changes and hopefully we'll be able to deliver it on the next version.
Thanks a lot for the contribution.

Postgres support will be released in the upcoming version.