ClickHouse/metabase-clickhouse-driver

Map types are being scanned unnecessarily

nilaymk opened this issue ยท 13 comments

See thread here: https://discourse.metabase.com/t/scanning-of-map-types-in-clickhouse/23106

The issue I think is that the driver doesn't provide the MHTS baseType for clickhouse Map types. This makes Metabase assume it is a category type that should be with the following query which is very inefficient for colums with huge Maps.

SELECT `my_db`.`my_table`.`my_column` AS `my_column_ex_544d1323`
FROM `my_db`.`my_column`
GROUP BY `my_db`.`my_table`.`my_column`
ORDER BY `my_db`.`my_table`.`my_column` ASC LIMIT 1000
FORMAT TabSeparatedWithNamesAndTypes

Of course, for Map types it doesn't make sense as it is unlikely to be used for field filters directly.

So my proposal is to simple translate clickhouse Map type to MHTS types/Collection baseType

Would this work?

Yep, I'd like to add Map type support once we finish with the current bugs. See #98

Hi Folks any news on this, especially around disabling the scanning for Map and Array types. At this moment, it is a pain point for us, more so than a fully featured Map support.

Will my following proposal to stop Metabase from scanning Map and Array types work? (@slvrtrn I wasn't sure if you "Yep" meant it will work or just an acknowledgement of the overall issue around Map types.)

Translate Clickhouse Map and Array type to MHTS types/Collection baseType

@nilaymk
I will try to include the fix in the next release. Please stay tuned.

@nilaymk, do you think the Dictionary type will also work well in this case?
I expect we will have to use it anyway for #98.

If not, could you tell me how I can reliably trigger a query like the one reported in the OP, so I can verify that the Collection type works as intended?

I couldn't see it in my CH server logs when hitting "Sync database schema now" or "Re-scan fields values now", even using the driver without an explicit type mapping for Maps.

Regarding the arrays, is it the same issue? Currently, they have :type/Array mapping.
Does it also trigger unnecessary scans?

I clarified a bit about the field scanning.
If a field is not of type/Category, it should not be scanned.
I also explicitly set the Map type to type/Dictinary here: #150

Regarding the query:

SELECT `my_db`.`my_table`.`my_column` AS `my_column_ex_544d1323`
FROM `my_db`.`my_column`
GROUP BY `my_db`.`my_table`.`my_column`
ORDER BY `my_db`.`my_table`.`my_column` ASC LIMIT 1000
FORMAT TabSeparatedWithNamesAndTypes

@nilaymk, are you sure this is coming from the Metabase itself? my_column_ex_544d1323 looks very unusual for a Metabase alias.

Could you try 1.1.3 (Map base type is type/Dictionary there)?

Hi @slvrtrn , Thanks for looking into this. Sorry I am away for a few days. I'll try and respond asap.

I'll try the new and old plugin.

My Setup

Clickhouse: 22.7.1.2484

Database schema:

CREATE DATABASE nk;
CREATE TABLE nk.scan_tests (
  `_pk` Int64 DEFAULT now(),
  `date_val` DateTime,
  `int_val` Int64,
  `float_val` Float64, `string_val` String,
  `int_array` Array(Int64),
  `float_array` Array(Float64),
  `string_array` Array(String),
  `map_str_int` Map(String, Int64)
) ENGINE = MergeTree ORDER BY _pk SETTINGS index_granularity = 8192

Metabase: 0.45.3

Testing Method

  • Launch Metabase 0.45.3 with correct plugin version
  • In Metabase, create a database connection from Admin settings. Database connection settings are:
Database name: nk
Scan all databases: No
Scanning for Filter Values: Never
Periodically refingerprint tables: No
  • After connection is established, perform Discard saved field values and Re-scan field values now from the database connection page.
  • In Clickhouse, do SYSTEM FLUSH LOGS and check system.query_log with: select query from system.query_log ...

With metabase-clickhouse-driver v1.1.2

Re-scan field values now attempts to scan Map fields โŒ

Query log shows following queries:

-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`int_val` AS `int_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`int_val` ORDER BY `nk`.`scan_tests`.`int_val` ASC LIMIT 1000

So field types scanned are

Type Is Scanned Should Scan OK?
Int64 Yes Yes ๐Ÿ‘
String Yes Yes ๐Ÿ‘
Float64 No No ๐Ÿ‘
DateTime No No ๐Ÿ‘
Array(T) No No ๐Ÿ‘
Map(K, V) Yes No โŒ

With metabase-clickhouse-driver v1.1.3

Re-scan field values now attempts to scan Map fields โŒ

Query log shows following queries:

-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`int_val` AS `int_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`int_val` ORDER BY `nk`.`scan_tests`.`int_val` ASC LIMIT 1000

So field types scanned are

Type Is Scanned Should Scan OK?
Int64 Yes Yes ๐Ÿ‘
String Yes Yes ๐Ÿ‘
Float64 No No ๐Ÿ‘
DateTime No No ๐Ÿ‘
Array(T) No No ๐Ÿ‘
Map(K, V) Yes No โŒ

Finding

So it looks like in v1.1.3, the Map(K,V) columns (in my case map_str_int) are still being scanned
with query

-- Metabase SELECT `nk`.`scan_tests`.`map_str_int` AS `map_str_int` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`map_str_int` ORDER BY `nk`.`scan_tests`.`map_str_int` ASC LIMIT 1000

And reporting Map type as type/Dictionary type (i.e. not type/Category) still didn't make any difference.

Is this something that the folks from Metabase can shed a light on perhaps?

@slvrtrn some of my quick experiments and findings above... I will do more when I'm back next week if needed.

@slvrtrn
A side note: I also saw the strange behaviour you came across. i.e. No queries being hit on the clickhouse server even when repeatedly trying "Scan database schema now" and "Re-scan field values now". For me the behaviour happened after I upgraded my clickhouse-metabase-driver and restarted the Metabase server.

I had to recreate my database connection in Metabase ๐Ÿค” again.

For schema scans you should see queries like:

select null as TABLE_CAT, t.database as TABLE_SCHEM, t.name as TABLE_NAME, case when t.engine like '%Log' then 'LOG TABLE' when t.engine in ('Buffer', 'Memory', 'Set') then 'MEMORY TABLE' when t.is_temporary != 0 then 'TEMPORARY TABLE' when t.engine like '%View' then 'VIEW' when t.engine = 'Dictionary' then 'DICTIONARY' when t.engine like 'Async%' or t.engine like 'System%' then 'SYSTEM TABLE' when empty(t.data_paths) then 'REMOTE TABLE' else 'TABLE' end as TABLE_TYPE, t.comment as REMARKS, null as TYPE_CAT, d.engine as TYPE_SCHEM, t.engine as TYPE_NAME, null as SELF_REFERENCING_COL_NAME, null as REF_GENERATION
from system.tables t inner join system.databases d on t.database = d.name
where t.database like 'nk' and t.name like '%' and TABLE_TYPE in ('TABLE','VIEW','FOREIGN TABLE','REMOTE TABLE','DICTIONARY','MATERIALIZED VIEW','MEMORY TABLE','LOG TABLE') order by t.database, t.name

select null as TABLE_CAT, database as TABLE_SCHEM, table as TABLE_NAME, name as COLUMN_NAME, toInt32(1111) as DATA_TYPE, type as TYPE_NAME, toInt32(0) as COLUMN_SIZE, 0 as BUFFER_LENGTH, cast(null as Nullable(Int32)) as DECIMAL_DIGITS, 10 as NUM_PREC_RADIX, toInt32(position(type, 'Nullable(') >= 1 ? 1 : 0) as NULLABLE, comment as REMARKS, default_expression as COLUMN_DEF, 0 as SQL_DATA_TYPE, 0 as SQL_DATETIME_SUB, cast(null as Nullable(Int32)) as CHAR_OCTET_LENGTH, position as ORDINAL_POSITION, position(type, 'Nullable(') >= 1 ? 'YES' : 'NO' as IS_NULLABLE, null as SCOPE_CATALOG, null as SCOPE_SCHEMA, null as SCOPE_TABLE, null as SOURCE_DATA_TYPE, 'NO' as IS_AUTOINCREMENT, 'NO' as IS_GENERATEDCOLUMN from system.columns where database like 'nk' and table like 'scan_tests' and name like '%'

And for Field value scans the queries look like:

-- Metabase SELECT `nk`.`scan_tests`.`_pk` AS `_pk` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`_pk` ORDER BY `nk`.`scan_tests`.`_pk` ASC LIMIT 1000

-- Metabase SELECT `nk`.`scan_tests`.`string_val` AS `string_val` FROM `nk`.`scan_tests` GROUP BY `nk`.`scan_tests`.`string_val` ORDER BY `nk`.`scan_tests`.`string_val` ASC LIMIT 1000

Also the strangeness in column name my_column_ex_544d1323 in my original question could be because I had a lot of tables with my_column columns. Maybe Metabase gave it an alias to match its backend.

Hey @slvrtrn sorry for so many messages on this thread ๐Ÿ˜„

OK here's my latest finding ... Good and bad news ๐Ÿ˜ž

GOOD NEWS: Resolving Map(K, V) types to type/Dictionary WORKS !!!. I.e. the Map columns are no longer classified as type/Category columns and thus NO field value sync will happen. Yay!

However, the BAD NEWS it only works on new database connections (i.e. connections created with the v1.1.3 driver), or for Map columns created and synced after the new driver has been installed.

For database connections created with v1.1.2 and older drivers, where the Map columns that were classified as type/Category are still being treated as type/Category and Metabase doesn't seem to take the new type/Dictionary type into account even after repeated Schema Syncs.

So two workarounds exists:

  1. Delete and recreate the database connection. โš ๏ธ All the custom Data Model properties will be LOST โš ๏ธ
    OR
  2. Manually edit the field's property for all concerned fields in Metabase:
  • In Metabase, go to Admin > Data Model and choose the desired database and table.
  • Navigate to the desired columns and click on the โš™๏ธ icon to go to the field's settings
  • Set the Field Type to No semantic type AND the Filtering on this field to Plain input box

image

- This should stop field value scan for the field.

I think this issue can be closed, but it might beneficial for others if the above workaround is put in the release notes.

Thanks for the fix btw ๐Ÿ‘

@nilaymk, thanks for the testing! Glad that it worked in the end.
I will close the issue then and add these caveats to README later.