Escaping identifieres makes them different
Matts966 opened this issue · 4 comments
`dataset.table`
and dataset.table
are recognized as different identifieres.
Unifying style in queries prevent this problem for a while.
On our end, we handle this by passing the table names through a "bigquery_normalize_identifier(input, default_project_id)" function in our consuming application in Python.
Since BigQuery is currently our only target, this could be passed down to AlphaSQL as well.
Background on how BigQuery handles unqualified table names
The normalisation function receives two arguments, the identifier and default BigQuery project id the query should be executed in.
BigQuery supports queries against tables identified only by their dataset + table name. e.g. core.orders
will then query the table orders
in the datasets core
and the project_id
(i.e. database) will be inferred from the caller/where the job is executed.
A fully qualified table identifier can be written like so, encased in two backtick (`
) characters: `project_id.dataset.table_name`
AlphaSQL could handle table name normalisation
I can imagine AlphaSQL could normalise all identifiers it finds in a given set of queries.
Imagine the following setup:
Given a query foo.sql
that creates a table:
-- samples/issue-1/foo.sql
create or replace table core.foo as
select 'foo' as col1
And a query assert_foo.sql
that queries the other table:
-- samples/issue-1/assert_foo.sql
select count(1) > 0
from `project-123.core.foo`
where col1 = 'foo'
Now AlphaSQL could support a command-line argument to "qualify" all identifiers it encounters:
--default_project_id=
in order to provide a default BigQuery project id--qualify_identifiers
a flag to enable the feature when parsing
# Imaginary example
$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/
Currently AlphaSQL outputs the following information:
# At the time of writing, AlphaSQL does not yet support these command line flags
$ alphadag --with_tables samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="core.foo", shape=box, type=table];
3 [label="project-123.core.foo", shape=box, type=table];
1->2 ;
3->0 ;
}
EXTERNAL REQUIRED TABLES:
project-123.core.foo
But it could (given the imaginary options) output the following normalised information. Note the labels on the digraph output.
$ alphadag --with_tables --default_project_id=project-123 --qualify_identifiers samples/issue-1/
Reading paths passed as a command line arguments...
Only files that end with .sql or .bq are analyzed.
Reading "samples/issue-1/assert_foo.sql"
Reading "samples/issue-1/foo.sql"
digraph G {
0 [label="samples/issue-1/assert_foo.sql", shape="", type=query];
1 [label="samples/issue-1/foo.sql", shape="", type=query];
2 [label="`project-123.core.foo`", shape=box, type=table];
1->2 ;
2->0 ;
}
EXTERNAL REQUIRED TABLES:
@lukas-at-harren
Thank you for your detailed verification!
I think that this issue is not an actual problem because `table`
and table
are recognized as same now.
Is there any edge cases for this issue?
Also, project_id is a dynamic property and making it an option seems complex for me.
We automatically suggests unifying project.dataset1.table1
and dataset1.table1
using Python.
Could this issue be closed?
Found that making `dataset1.table1`
dataset1.table1
cause error in case dataset1.table1
is in JSON.
JSON input is dataset1.table1
and it is weird.
Currently CREATE resources like dataset.resource_name
is not supported, but CREATE resources like `dataset.resource_name`
is supported.
I found that this is because nested catalog is not initialized. Instead, current implementation simply add resources with identifier vectors and this leads to errors Resources not found: resource_name
.
We should improve the way to call AddOwned~
functions.
For now, workaround is simply escaping like `dataset.resource_names`
.