DateTime64(ns) KeyError, when ns is set in table definition
Opened this issue · 4 comments
Hey, thanks for your great work, this module has been really helpful!
using clickhouse-server version: 20.3.8.53
I have a small problem with DateTime64(ns) where the size of ns has been explicitly set in the table definition.
With a clickhouse table as follows:
CREATE TABLE db_name.tbl_name (
Timestamp DateTime64(6) CODEC(Delta(8), LZ4),
SomeData Float32 CODEC(LZ4)
)
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(Timestamp)
ORDER BY intHash64(toInt64(Timestamp))
SAMPLE BY intHash64(toInt64(Timestamp))
A query:
query = "SELECT DISTINCT * FROM db_name.tbl_name"
connection = {
'host': server_name,
'database': db_name,
'user': 'default'
}
pandahouse.read_clickhouse(query=query, index=False, connection=connection)
Results in:
pandahouse/core.py in read_clickhouse(query, tables, index, connection, **kwargs)
56 lines = execute(query, external=external, stream=True,
57 connection=connection)
---> 58 return to_dataframe(lines, **kwargs)
59
60
pandahouse/convert.py in to_dataframe(lines, **kwargs)
65 dtypes, parse_dates, converters = {}, [], {}
66 for name, chtype in zip(names, types):
---> 67 dtype = CH2PD[chtype]
68 if dtype == 'object':
69 converters[name] = decode_escapes
KeyError: 'DateTime64(6)
If I understand correctly, the mapping defined earlier in the file convert.py
MAPPING = {'object': 'String',
'uint64': 'UInt64',
'uint32': 'UInt32',
'uint16': 'UInt16',
'uint8': 'UInt8',
'float64': 'Float64',
'float32': 'Float32',
'int64': 'Int64',
'int32': 'Int32',
'int16': 'Int16',
'int8': 'Int8',
'datetime64[D]': 'Date',
'datetime64[ns]': 'DateTime'}
does not cover the DateTime64(6) case. Or by extension any other DateTime(ns) case?
I would be happy to contribute a solution with a little guidance.
Thanks
not sure if this is acceptable, but a solution could be as simple as:
def to_dataframe(lines, **kwargs):
names = lines.readline().decode('utf-8').strip().split('\t')
types = lines.readline().decode('utf-8').strip().split('\t')
dtypes, parse_dates, converters = {}, [], {}
for name, chtype in zip(names, types):
if chtype.startswith('DateTime64'):
precs = int(chtype.replace('DateTime64(', '').replace(')', ''))
chtype = 'DateTime'
dtype = CH2PD[chtype]
if dtype == 'object':
converters[name] = decode_escapes
elif dtype.startswith('datetime'):
parse_dates.append(name)
else:
dtypes[name] = dtype
return pd.read_table(lines, header=None, names=names, dtype=dtypes,
parse_dates=parse_dates, converters=converters,
na_values=set(), keep_default_na=False, **kwargs)
Thanks for the bug report and the patch.
I'm trying to allocate some time to create a new release in the upcoming week.
Could you please submit a PR including unittests?
My mid-term plan is to use the newly added arrow and parquet clickhouse output formats, but their type support is incomplete so far.