googleapis/python-bigquery-sqlalchemy

BigQuery-SQLAlchemy incorrectly renames valid column names

Closed this issue · 1 comments

This line of code inside of the BigQuery compiler: https://github.com/googleapis/python-bigquery-sqlalchemy/blob/main/sqlalchemy_bigquery/base.py#L117-L118 substitutes _ in for any illegal characters as part of a BigQuery column identifier.

This check is more aggressive than it seems BigQuery actually is.

In the below examples, a name is changed by BigQuery-SQLAlchemy that works in the BigQuery console.

Name changes are pretty disruptive, as there is no warning or error, and typically there is downstream code dependent on identifiers remaining stable... them changing during compilation essentially breaks any downstream consumer unless they coordinate on the specific rules BigQuery-SQLAlchemy encodes internally.

I also think the SQLAlchemy https://docs.sqlalchemy.org/en/20/core/sqlelement.html#sqlalchemy.sql.expression.quoted_name API contract is not being respected in this case. That API specifies:

A quoted_name object with quote=True is also prevented from being modified in the case of a so-called “name normalize” option.

which to me implies that no automatic translation should be happening from the BigQuery compiler either. If removing the automatic substitution of characters isn't feasible for whatever reason, respecting that API as an escape hatch may be a more reasonable option.

Environment details

  • OS type and version: macOS 14.5
  • Python version: Python 3.9.6
  • pip version: pip 23.3.2
  • sqlalchemy-bigquery version: Version: 1.5.0

Steps to reproduce

  1. Use sqlalchemy.sql.column("some column").label("labeled column")
  2. Specify this element in a query
  3. BigQuery-SQLAlchemy will render this as
`some column` AS `labeled_column`

instead of:

`some column` AS `labeled column`

Code example

engine = None # auth to your BigQuery engine
query = sa.select(sa.column("some column").label("labeled column")).select_from(
     sa.table("table")
    )
# compile the query
query_str = str(query.compile(engine))
print(query_str)
""" # this query is incorrect; it renamed `labeled column` to `labeled_column`
SELECT `some column` AS `labeled_column` 
FROM `table`
"""

However, if we write the query ourselves, we are allowed to use labeled column as an identifier:

SELECT `some column` AS `labeled column` 
FROM `table`

The above runs correctly in the BigQuery console.

I'm dumb and this is fixed in recent versions of BigQuery-SQLAlchemy. 🤦‍♂️
My sincere apologies.