GlareDB/glaredb

bug: infer db name on SELECT and not just on CREATE

talagluck opened this issue · 6 comments

Description

This is a strange issue I'm hitting when trying to run dbt with GlareDB Cloud. The short of it is that this code works:

  create or replace view "empty_rain"."public"."my_view"

but this code does not:

select *
from "empty_rain"."public"."my_view"

where "empty_rain" is the name of my GlareDB Cloud deployment.

For the SELECT query, this will work:

select *
from "default"."public"."my_view"

So essentially, we have deployment name -> db name (default) -> schema name -> table or view name.

So I think what is happening is that when we create a table or a view, when we pass in deployment_name, we are able to assume the default DB name. But when we try this in a SELECT statement, it does not work. Incidentally, the below code also DOES NOT work:

select *
from "empty_rain"."default"."public"."my_view"

All of this matters, because when I configure my credentials for dbt, it asks me for my database name. When I configure this, I need to pass in my deployment name, or else I get a connection error. But if I try and select from dbt_model_A in dbt_model_B, when dbt compiles my queries, it looks for any existing tables at "empty_rain"."public"."dbt_model_A", which fails.

we don't currently support CREATE DATABASE semantics, so the first part of the query is getting silently dropped.

create or replace view "empty_rain"."public"."my_view"; is being interpreted as
create or replace view "public"."my_view";

> create database "empty_rain";
Error: SQL statement currently unsupported: CREATE DATABASE "empty_rain"

For some additional context:
Basically dbt lets you have models, which are really just individual SQL queries, which you can string together into a pipeline of transformations. So imagine the following:

  • you have an existing events table in your database.
  • dbt lets you write a query to select all of the events from the US, called us_events
  • then you write another query which gets a count of the number of US events per month, called us_events_per_month

Your query to select all of the US events would like something like:

SELECT * FROM {{ ref('events') }} WHERE country='US'

Your query to aggregate by month would be something like:

SELECT month, count(*) FROM {{ ref('us_events') }} GROUP BY month

The curly braces are Jinja, which is a templating language. When you run dbt, it first compiles your queries written in Jinja. So the first query would look something like:

CREATE VIEW us_events AS 
SELECT * FROM "db_name"."schema_name"."events"

Then it runs all of your queries to create views or tables in your database. Depending on your materialization settings, you'd wind up with view or tables for each of the models that you've run.

It populates the database and schema names based on how the project is set-up, which afaik relies on the database name used in your connection credentials. So when I specify my credentials, under dbname, I'm currently using empty_rain which is the name of my deployment. But when I select from another model, I need to use default which is the name of my database.

@talagluck are you able to provide a minimum reproducible example? that would help greatly with troubleshooting!

If you'd like a fully functioning dbt example to reproduce this, you can clone https://github.com/GlareDB/dbt_glaredb_sample and follow the instructions in the README (please let me know if you hit any issues or have questions).

Otherwise, the minimum reproducible example would be getting the SQL queries above to work. Ideally, either this query would work:

select *
from "empty_rain"."public"."my_view"

or else this connection string would work:

postgresql://xf0IKXRqc0YH:<PASS>@o_vAIOmhx.proxy.qa.glaredb.com:6543/empty_rain

Is the second one possible? Do we currently have any cases with more than one deployment per host?

The alternatives are building a custom dbt adapter (which is in progress, but I was hoping that the Postgres route would be viable, too), or else building a custom dbt query compiler, which I think is do-able, but will add friction which I'd rather avoid.

Implementation notes...

Introduce the concept of "default database alias" to allow for both of these queries to work:

select * from default.public.my_view;
select * from empty_ran.public.my_view;

We have the deployment name available to us when going through either the postgres or rpc proxy, so what we should do is pass that on to the glaredb node similarly to how we pass on database id (we might already be doing that, idk).

When we create the session, we store that name somewhere. I'm partial to storing it on the SessionCatalog:

pub struct SessionCatalog {
    /// Optional alias for referencing objects in this catalog
    alias: Option<String>,
    ...
}

And then we'd make a change during resolving here: https://github.com/glaredb/glaredb/blob/bae7fcd4cda8bcfb91ab8e09fe881e4a55c1488f/crates/sqlexec/src/resolve.rs#L106-L140

There's logic for checking for "default" as the database name, and we'd just extend it to also check for the alias if the catalog has one.

I don't imagine anything else will need to change for this, and local/embedded instances will continue to only have "default" as the only way to reference the database.

Re-opening per the Slack discussion