Map types are being scanned unnecessarily
nilaymk opened this issue ยท 13 comments
See thread here: https://discourse.metabase.com/t/scanning-of-map-types-in-clickhouse/23106
The issue I think is that the driver doesn't provide the MHTS baseType
for clickhouse Map
types. This makes Metabase assume it is a category type that should be with the following query which is very inefficient for colums with huge Map
s.
SELECT `my_db`.`my_table`.`my_column` AS `my_column_ex_544d1323`
FROM `my_db`.`my_column`
GROUP BY `my_db`.`my_table`.`my_column`
ORDER BY `my_db`.`my_table`.`my_column` ASC LIMIT 1000
FORMAT TabSeparatedWithNamesAndTypes
Of course, for Map
types it doesn't make sense as it is unlikely to be used for field filters directly.
So my proposal is to simple translate clickhouse Map
type to MHTS types/Collection
baseType
Would this work?
Yep, I'd like to add Map type support once we finish with the current bugs. See #98
Hi Folks any news on this, especially around disabling the scanning for Map
and Array
types. At this moment, it is a pain point for us, more so than a fully featured Map
support.
Will my following proposal to stop Metabase from scanning Map
and Array
types work? (@slvrtrn I wasn't sure if you "Yep" meant it will work or just an acknowledgement of the overall issue around Map
types.)
Translate Clickhouse
Map
andArray
type to MHTS types/Collection baseType
@nilaymk, do you think the Dictionary
type will also work well in this case?
I expect we will have to use it anyway for #98.
If not, could you tell me how I can reliably trigger a query like the one reported in the OP, so I can verify that the Collection
type works as intended?
I couldn't see it in my CH server logs when hitting "Sync database schema now" or "Re-scan fields values now", even using the driver without an explicit type mapping for Maps.
Regarding the arrays, is it the same issue? Currently, they have :type/Array
mapping.
Does it also trigger unnecessary scans?
I clarified a bit about the field scanning.
If a field is not of type/Category
, it should not be scanned.
I also explicitly set the Map type to type/Dictinary
here: #150
Regarding the query:
SELECT `my_db`.`my_table`.`my_column` AS `my_column_ex_544d1323`
FROM `my_db`.`my_column`
GROUP BY `my_db`.`my_table`.`my_column`
ORDER BY `my_db`.`my_table`.`my_column` ASC LIMIT 1000
FORMAT TabSeparatedWithNamesAndTypes
@nilaymk, are you sure this is coming from the Metabase itself? my_column_ex_544d1323
looks very unusual for a Metabase alias.
Could you try 1.1.3 (Map base type is type/Dictionary
there)?
Hi @slvrtrn , Thanks for looking into this. Sorry I am away for a few days. I'll try and respond asap.
I'll try the new and old plugin.
My Setup
Clickhouse: 22.7.1.2484
Database schema:
CREATE DATABASE nk;
CREATE TABLE nk.scan_tests (
`_pk` Int64 DEFAULT now(),
`date_val` DateTime,
`int_val` Int64,
`float_val` Float64, `string_val` String,
`int_array` Array(Int64),
`float_array` Array(Float64),
`string_array` Array(String),
`map_str_int` Map(String, Int64)
) ENGINE = MergeTree ORDER BY _pk SETTINGS index_granularity = 8192
Metabase: 0.45.3
Testing Method
- Launch Metabase 0.45.3 with correct plugin version
- In Metabase, create a database connection from Admin settings. Database connection settings are:
Database name: nk
Scan all databases: No
Scanning for Filter Values: Never
Periodically refingerprint tables: No
- After connection is established, perform
Discard saved field values
andRe-scan field values now
from the database connection page. - In Clickhouse, do
SYSTEM FLUSH LOGS
and checksystem.query_log
with:select query from system.query_log ...
With metabase-clickhouse-driver v1.1.2
Re-scan field values now attempts to scan Map
fields โ
Query log shows following queries:
-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`int_val` AS `int_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`int_val` ORDER BY `nk`.`scan_tests`.`int_val` ASC LIMIT 1000
So field types scanned are
Type | Is Scanned | Should Scan | OK? |
---|---|---|---|
Int64 | Yes | Yes | ๐ |
String | Yes | Yes | ๐ |
Float64 | No | No | ๐ |
DateTime | No | No | ๐ |
Array(T) | No | No | ๐ |
Map(K, V) | Yes | No | โ |
With metabase-clickhouse-driver v1.1.3
Re-scan field values now attempts to scan Map
fields โ
Query log shows following queries:
-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`int_val` AS `int_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`int_val` ORDER BY `nk`.`scan_tests`.`int_val` ASC LIMIT 1000
So field types scanned are
Type | Is Scanned | Should Scan | OK? |
---|---|---|---|
Int64 | Yes | Yes | ๐ |
String | Yes | Yes | ๐ |
Float64 | No | No | ๐ |
DateTime | No | No | ๐ |
Array(T) | No | No | ๐ |
Map(K, V) | Yes | No | โ |
Finding
So it looks like in v1.1.3
, the Map(K,V)
columns (in my case map_str_int
) are still being scanned
with query
-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000
And reporting Map
type as type/Dictionary
type (i.e. not type/Category
) still didn't make any difference.
Is this something that the folks from Metabase can shed a light on perhaps?
@slvrtrn some of my quick experiments and findings above... I will do more when I'm back next week if needed.
@slvrtrn
A side note: I also saw the strange behaviour you came across. i.e. No queries being hit on the clickhouse server even when repeatedly trying "Scan database schema now" and "Re-scan field values now". For me the behaviour happened after I upgraded my clickhouse-metabase-driver and restarted the Metabase server.
I had to recreate my database connection in Metabase ๐ค again.
For schema scans you should see queries like:
select null as TABLE_CAT, t.database as TABLE_SCHEM, t.name as TABLE_NAME, case when t.engine like '%Log' then 'LOG TABLE' when t.engine in ('Buffer', 'Memory', 'Set') then 'MEMORY TABLE' when t.is_temporary != 0 then 'TEMPORARY TABLE' when t.engine like '%View' then 'VIEW' when t.engine = 'Dictionary' then 'DICTIONARY' when t.engine like 'Async%' or t.engine like 'System%' then 'SYSTEM TABLE' when empty(t.data_paths) then 'REMOTE TABLE' else 'TABLE' end as TABLE_TYPE, t.comment as REMARKS, null as TYPE_CAT, d.engine as TYPE_SCHEM, t.engine as TYPE_NAME, null as SELF_REFERENCING_COL_NAME, null as REF_GENERATION
from system.tables t inner join system.databases d on t.database = d.name
where t.database like 'nk' and t.name like '%' and TABLE_TYPE in ('TABLE','VIEW','FOREIGN TABLE','REMOTE TABLE','DICTIONARY','MATERIALIZED VIEW','MEMORY TABLE','LOG TABLE') order by t.database, t.name
select null as TABLE_CAT, database as TABLE_SCHEM, table as TABLE_NAME, name as COLUMN_NAME, toInt32(1111) as DATA_TYPE, type as TYPE_NAME, toInt32(0) as COLUMN_SIZE, 0 as BUFFER_LENGTH, cast(null as Nullable(Int32)) as DECIMAL_DIGITS, 10 as NUM_PREC_RADIX, toInt32(position(type, 'Nullable(') >= 1 ? 1 : 0) as NULLABLE, comment as REMARKS, default_expression as COLUMN_DEF, 0 as SQL_DATA_TYPE, 0 as SQL_DATETIME_SUB, cast(null as Nullable(Int32)) as CHAR_OCTET_LENGTH, position as ORDINAL_POSITION, position(type, 'Nullable(') >= 1 ? 'YES' : 'NO' as IS_NULLABLE, null as SCOPE_CATALOG, null as SCOPE_SCHEMA, null as SCOPE_TABLE, null as SOURCE_DATA_TYPE, 'NO' as IS_AUTOINCREMENT, 'NO' as IS_GENERATEDCOLUMN from system.columns where database like 'nk' and table like 'scan_tests' and name like '%'
And for Field value scans the queries look like:
-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000
-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000
Also the strangeness in column name my_column_ex_544d1323
in my original question could be because I had a lot of tables with my_column
columns. Maybe Metabase gave it an alias to match its backend.
Hey @slvrtrn sorry for so many messages on this thread ๐
OK here's my latest finding ... Good and bad news ๐
GOOD NEWS: Resolving Map(K, V)
types to type/Dictionary
WORKS !!!. I.e. the Map
columns are no longer classified as type/Category
columns and thus NO field value sync will happen. Yay!
However, the BAD NEWS it only works on new database connections (i.e. connections created with the v1.1.3 driver), or for Map
columns created and synced after the new driver has been installed.
For database connections created with v1.1.2 and older drivers, where the Map
columns that were classified as type/Category
are still being treated as type/Category
and Metabase doesn't seem to take the new type/Dictionary
type into account even after repeated Schema Syncs.
So two workarounds exists:
- Delete and recreate the database connection.
โ ๏ธ All the custom Data Model properties will be LOSTโ ๏ธ
OR - Manually edit the field's property for all concerned fields in Metabase:
- In Metabase, go to
Admin > Data Model
and choose the desired database and table. - Navigate to the desired columns and click on the โ๏ธ icon to go to the field's settings
- Set the
Field Type
toNo semantic type
AND theFiltering on this field
toPlain input box
I think this issue can be closed, but it might beneficial for others if the above workaround is put in the release notes.
Thanks for the fix btw ๐