crflynn/databricks-dbapi

Unable to reflect table

ConstantinoSchillebeeckx opened this issue · 2 comments

Hi there!

I'm going through your example of using this with SQLAlchemy. I've got a Databricks cluster spun up, have generated a PAT, and am able to successfully connect. However, whenever I try to do something like

Table("my_table", MetaData(bind=engine), autoload=True)

I get

expected string or bytes-like object
Traceback (most recent call last):
  File "/scripts/pipelines/airflow/operators/mysql_to_detabricks_operator.py", line 137, in databricks_table_sa
    self._databricks_table, MetaData(bind=self.databricks_sqla), autoload=True
  File "<string>", line 2, in __new__
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/deprecations.py", line 139, in warned
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 559, in __new__
    metadata._remove_table(name, schema)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
    exc_value, with_traceback=exc_tb,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 554, in __new__
    table._init(name, metadata, *args, **kw)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 648, in _init
    resolve_fks=resolve_fks,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 689, in _autoload
    _extend_on=_extend_on,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2221, in run_callable
    return conn.run_callable(callable_, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1660, in run_callable
    return callable_(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 467, in reflecttable
    table, include_columns, exclude_columns, resolve_fks, **opts
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 666, in reflecttable
    table_name, schema, **table.dialect_kwargs
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/reflection.py", line 392, in get_columns
    self.bind, table_name, schema, info_cache=self.info_cache, **kw
  File "/usr/local/lib/python3.7/site-packages/pyhive/sqlalchemy_hive.py", line 319, in get_columns
    col_type = re.search(r'^\w+', col_type).group(0)
  File "/usr/local/lib/python3.7/re.py", line 185, in search
    return _compile(pattern, flags).search(string)

Digging a little bit, I've discovered that the return of DESCRIBE my_table returns something like:
image

whereas this line tells me it's expecting to break on "# Partition Information". Therefore col_type ends up being None when the re.search is being run.

What am I missing? I've inherited this code, and I'm new to whole pyspark/hive/etc space so I'm not sure this library is even the culprit. Please let me know what other information I can provide. FWIW here's my pip freeze

alabaster==0.7.12
alembic==1.4.2
amqp==2.6.1
ansiwrap==0.8.4
apache-airflow==1.10.9
apispec==1.3.3
appdirs==1.4.3
appnope==0.1.2
argcomplete==1.11.1
arrow==0.14.2
asn1crypto==0.24.0
astroid==2.2.5
async-generator==1.10
attrs==19.3.0
Authlib==0.14.3
autopep8==1.5.4
avro-python3==1.10.0
aws-xray-sdk==0.95
azure-common==1.1.8
azure-nspkg==2.0.0
azure-storage-blob==0.37.1
azure-storage-common==0.37.1
azure-storage-nspkg==2.0.0
Babel==2.8.0
backcall==0.2.0
bcrypt==3.1.7
billiard==3.6.3.0
black==20.8b1
boto==2.49.0
boto3==1.7.84
botocore==1.10.84
bumpversion==0.5.3
cached-property==1.5.1
cachetools==4.1.0
cattrs==0.9.2
celery==4.4.4
certifi==2020.4.5.1
cffi==1.14.0
cfgv==3.2.0
chardet==3.0.4
click==7.1.2
colorama==0.4.3
colorlog==4.0.2
configparser==3.5.3
cookies==2.2.1
coverage==4.5.3
croniter==0.3.32
cryptography==2.8
cycler==0.10.0
databricks-connect==6.5.1
databricks-dbapi==0.3.0
databricks-test==0.0.4
decorator==4.4.2
defusedxml==0.6.0
dill==0.3.1.1
distlib==0.3.1
dnspython==1.16.0
doc8==0.8.1
docker==4.4.1
docopt==0.6.2
docutils==0.16
ecdsa==0.16.1
email-validator==1.1.1
entrypoints==0.3
factory-boy==2.12.0
Faker==1.0.8
filelock==3.0.12
flake8==3.8.4
Flask==1.1.2
Flask-Admin==1.5.4
Flask-AppBuilder==2.3.4
Flask-Babel==1.0.0
Flask-Bcrypt==0.7.1
Flask-Caching==1.3.3
Flask-JWT-Extended==3.24.1
Flask-Login==0.4.1
Flask-OpenID==1.2.5
Flask-SQLAlchemy==2.4.3
flask-swagger==0.2.13
Flask-WTF==0.14.3
flower==0.9.4
freezegun==1.0.0
funcsigs==1.0.2
future==0.16.0
futures==3.1.1
google-api-core==1.14.2
google-api-python-client==1.7.11
google-auth==1.6.3
google-auth-httplib2==0.0.3
google-oauth==1.0.1
googleapis-common-protos==1.51.0
graphviz==0.14
gunicorn==19.10.0
hmsclient==0.1.1
httplib2==0.18.1
humanize==0.5.1
identify==1.5.11
idna==2.8
ijson==3.0.4
imagesize==1.2.0
importlib-metadata==1.7.0
iniconfig==1.0.0
ipykernel==5.1.4
ipython==7.19.0
ipython-genutils==0.2.0
iso8601==0.1.12
isort==4.3.21
itsdangerous==1.1.0
JayDeBeApi==1.2.1
jedi==0.18.0
Jinja2==2.10.3
jmespath==0.10.0
joblib==1.0.0
JPype1==0.7.5
json-merge-patch==0.2
jsondiff==1.1.1
jsonpickle==1.4.2
jsonschema==3.2.0
jupyter-client==6.1.11
jupyter-core==4.7.0
kiwisolver==1.3.1
kombu==4.6.11
lazy-object-proxy==1.4.1
lockfile==0.12.2
m2r==0.2.1
Mako==1.1.3
Markdown==2.6.11
MarkupSafe==1.1.1
marshmallow==2.21.0
marshmallow-enum==1.5.1
marshmallow-sqlalchemy==0.23.1
matplotlib==3.1.2
mccabe==0.6.1
mirakuru==2.3.0
mistune==0.8.4
mock==4.0.3
moto==1.3.4
mypy-extensions==0.4.3
mysqlclient==1.4.4
natsort==7.0.1
nbclient==0.5.1
nbformat==5.0.8
nest-asyncio==1.4.3
nodeenv==1.5.0
numpy==1.18.4
packaging==20.8
pandas==0.25.3
papermill==2.0.0
paramiko==2.7.1
parso==0.8.1
pathspec==0.8.1
pbr==5.5.1
pendulum==1.4.4
pexpect==4.8.0
pickleshare==0.7.5
pluggy==0.13.1
port-for==0.4
pre-commit==2.9.3
prison==0.1.3
prompt-toolkit==3.0.10
protobuf==3.12.2
psutil==5.7.0
ptyprocess==0.7.0
py==1.10.0
py4j==0.10.7
pyaml==20.4.0
pyarrow==0.17.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
PyAthena==1.6.1
pycodestyle==2.6.0
pycparser==2.20
pycryptodome==3.9.9
pycryptodomex==3.9.7
pyflakes==2.2.0
Pygments==2.6.1
PyHive==0.6.2
PyJWT==1.7.1
pykwalify==1.7.0
pylint==2.3.1
pymongo==3.8.0
PyNaCl==1.4.0
pyOpenSSL==19.1.0
pyparsing==2.4.7
pyrsistent==0.16.0
pysftp==0.2.9
pytest==6.2.1
pytest-freezegun==0.4.2
pytest-mongo==1.2.1
pytest-mysql==2.0.1
pytest-sftpserver==1.3.0
python-daemon==2.1.2
python-dateutil==2.8.1
python-decouple==3.4
python-editor==1.0.4
python-jose==2.0.2
python3-openid==3.1.0
pytz==2020.1
pytzdata==2019.3
PyYAML==5.1.1
pyzmq==20.0.0
redis==3.5.3
regex==2020.11.13
requests==2.23.0
requests-mock==1.7.0
responses==0.12.1
restructuredtext-lint==1.3.2
rsa==4.0
s3transfer==0.1.13
sasl==0.2.1
scikit-learn==0.22.1
scipy==1.6.0
setproctitle==1.1.10
simple-salesforce==1.10.1
six==1.15.0
slackclient==1.3.0
snowballstemmer==2.0.0
snowflake-connector-python==2.0.0
Sphinx==3.3.1
sphinx-rtd-theme==0.5.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-confluencebuilder==1.3.0
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==1.0.3
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.4
SQLAlchemy==1.3.17
SQLAlchemy-JSONField==0.9.0
SQLAlchemy-Utils==0.36.6
sqlparse==0.3.0
sshtunnel==0.1.5
stevedore==3.3.0
tabulate==0.8.7
tenacity==4.12.0
termcolor==1.1.0
text-unidecode==1.2
textwrap3==0.9.2
thrift==0.13.0
thrift-sasl==0.4.2
toml==0.10.1
tornado==5.1.1
tqdm==4.55.1
traitlets==5.0.5
typed-ast==1.4.0
typing-extensions==3.7.4.2
tzlocal==1.5.1
unicodecsv==0.14.1
uritemplate==3.0.1
urllib3==1.25.10
vine==1.3.0
virtualenv==20.2.2
wcwidth==0.2.5
websocket-client==0.54.0
Werkzeug==0.16.0
wrapt==1.11.1
WTForms==2.2.1
xmltodict==0.12.0
zdesk==2.7.1
zipp==3.1.0
zope.deprecation==4.4.0

I'm not a databricks engineer so I'm not familiar with their internals but my guess is that databricks uses a fork of hive and made a frivolous change to the table description output header for the partitions.

In hive, the value is hardcoded here: https://github.com/apache/hive/blob/fc2d47f85e03e9f1a2f79df34b826640062bbf6d/ql/src/java/org/apache/hadoop/hive/ql/ddl/table/info/desc/formatter/TextDescTableFormatter.java#L146

I think we can just override the get_columns method in the DatabricksDialect here to to read:

            if col_name in ('# Partition Information', '# Partitioning'):

which would make it work for both Databricks' hive and vanilla hive.

I need to make a few more changes and will provide a new release later today.