MeltanoLabs/tap-snowflake

Option to print SQL statements for debug/perf reasons

Opened this issue ยท 7 comments

@edgarrmondragon and @kgpayne - Is there any option that you know of to print the SQL statements (commands or SELECT queries) as they are executing?

I believe I recall vaguely that SQLAlchemy supports something like this and I thought it could be helpful to enable it for scenarios where we want to debug what commands are being sent and/or to observe performance on specific commands.

@aaronsteers afaik sqlalchemy emits logs for all queries with the sqlalchemy.engine.Engine logger, so it's a matter of setting the log level of that to INFO.

@edgarrmondragon - Great, thanks! Turns out I didn't need it in this pass. At first I was getting weird results due to discovery running first and taking a minute on its own - but then mitigated by running meltano select EXTRACTOR --list between executions to make sure the catalog is up-to-date before running the 'real' sync operation.

It would probably still be nice to have a docs entry or env var that can toggle this on or off - but I don't think it's very high priority as of now.

@aaronsteers @edgarrmondragon I ran into performance issues with SQLAlchemy inspect and Snowflake during testing today too. TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

The snowflake dialect is implemented using a mix of show <schema/tables> in <database/schema> and select ... from information_schema where table_schema=<schema>, and in general steps sequentially through discovered entities fetching data. This results in many unnecessary calls to the same information_schema tables filtered to individual schema/table names. We would be much better off reading whole information_schema tables once and filtering on the client side ๐Ÿคทโ€โ™‚๏ธ

Also, during pytest testing, we set stream._MAX_RECORDS_LIMIT = 1 however this is not pushed down into the SQL select statements (as far as I can see). This means the entire record set has to be compiled by the server and sent to the client (taking time a warehouse resources) for nothing ๐Ÿคฆโ€โ™‚๏ธ I was running tests on the SNOWFLAKE_SAMPLE_DATA and watching the XS warehouse try and chug through the largest table. It would have taken hours, if not days.

Will create issues for both ๐Ÿ‘

TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

@kgpayne https://github.com/transferwise/pipelinewise-tap-snowflake gets around that by having a tables setting to pre-select tables for discovery.

Also, during pytest testing, we set stream._MAX_RECORDS_LIMIT = 1 however this is not pushed down into the SQL select statements

Yeah, that makes sense that it isn't referencing stream._MAX_RECORDS_LIMIT as of now. Referencing that in the SDK with (hopefully) a generic SQLAlchemy LIMIT n implementation would totally make sense.

TLDR; is that our default implementation of catalog discovery assumes it is cheap to query db metadata, which in snowflake it really isn't.

Definitely worth raising and worth logging. Thanks for calling this out. Can you say a bit more about your execution times for discovery? I saw discovery finish in 1.5 minutes on the Snowflake sample DB, which is not great but doesn't seem horrible. I was also running from Codespaces so roundtip network latency may have also been significantly better vs running on home WiFi.

Doing some performance testing using the query history in the Snowflake UI, my first discovery query arrives at 11:53:38 AM and the first stream starts to load at 11:56:58 AM. So I guess performance isn't so terrible, and I agree it probably isn't a priority in this iteration. 3 minutes still feels like a long time for the terminal to hang for, and there is also inflated cost associated with constantly hitting the metadata tables, which we could look to improve in future ๐Ÿ™‚

Doing some performance testing using the query history in the Snowflake UI, my first discovery query arrives at 11:53:38 AM and the first stream starts to load at 11:56:58 AM. So I guess performance isn't so terrible, and I agree it probably isn't a priority in this iteration. 3 minutes still feels like a long time for the terminal to hang for, and there is also inflated cost associated with constantly hitting the metadata tables, which we could look to improve in future ๐Ÿ™‚

Agreed, and thanks for the additional detail. Having waited in the past upwards of 10-15 minutes for Salesforce to run discovery, I'm definitely tolerant of 3 minutes runtime for now, but I agree it makes sense to come back and tune this.

This Meltano-side issue would actually be a big help here, since visibility/observability is part of the problem:

On thinking through this a bit more... does the SDK print an INFO-level log when discovery is starting and finishing? (I don't think so.) If not, that's probably another quick win that wouldn't require a Meltano-side update.