Matts966/alphasql

Escaping identifieres makes them different

Matts966 opened this issue · 4 comments

`dataset.table` and dataset.table are recognized as different identifieres.

Unifying style in queries prevent this problem for a while.

On our end, we handle this by passing the table names through a "bigquery_normalize_identifier(input, default_project_id)" function in our consuming application in Python.
Since BigQuery is currently our only target, this could be passed down to AlphaSQL as well.

Background on how BigQuery handles unqualified table names

The normalisation function receives two arguments, the identifier and default BigQuery project id the query should be executed in.

BigQuery supports queries against tables identified only by their dataset + table name. e.g. core.orders will then query the table orders in the datasets core and the project_id (i.e. database) will be inferred from the caller/where the job is executed.

A fully qualified table identifier can be written like so, encased in two backtick (`) characters: `project_id.dataset.table_name`

AlphaSQL could handle table name normalisation

I can imagine AlphaSQL could normalise all identifiers it finds in a given set of queries.
Imagine the following setup:

Given a query foo.sql that creates a table:

-- samples/issue-1/foo.sql
create or replace table core.foo as
select 'foo' as col1

And a query assert_foo.sql that queries the other table:

-- samples/issue-1/assert_foo.sql
select count(1) > 0
from `project-123.core.foo`
where col1 = 'foo'

Now AlphaSQL could support a command-line argument to "qualify" all identifiers it encounters:

  • --default_project_id= in order to provide a default BigQuery project id
  • --qualify_identifiers a flag to enable the feature when parsing
# Imaginary example
$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/

Currently AlphaSQL outputs the following information:

# At the time of writing, AlphaSQL does not yet support these command line flags
$ alphadag --with_tables samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="core.foo", shape=box, type=table];
3 [label="project-123.core.foo", shape=box, type=table];
1->2 ;
3->0 ;
}
EXTERNAL REQUIRED TABLES:
project-123.core.foo

But it could (given the imaginary options) output the following normalised information. Note the labels on the digraph output.

$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="`project-123.core.foo`", shape=box, type=table];
1->2 ;
2->0 ;
}
EXTERNAL REQUIRED TABLES:

@lukas-at-harren
Thank you for your detailed verification!

I think that this issue is not an actual problem because `table` and table are recognized as same now.
Is there any edge cases for this issue?

Also, project_id is a dynamic property and making it an option seems complex for me.
We automatically suggests unifying project.dataset1.table1 and dataset1.table1 using Python.

Could this issue be closed?

Found that making `dataset1.table1` dataset1.table1 cause error in case dataset1.table1 is in JSON.
JSON input is dataset1.table1 and it is weird.

Currently CREATE resources like dataset.resource_name is not supported, but CREATE resources like `dataset.resource_name` is supported.

I found that this is because nested catalog is not initialized. Instead, current implementation simply add resources with identifier vectors and this leads to errors Resources not found: resource_name.

We should improve the way to call AddOwned~ functions.

For now, workaround is simply escaping like `dataset.resource_names`.