snowflakedb/snowflake-jdbc

SNOW-972142: The CLIENT_RESULT_CHUNK_SIZE parameter is not taken into account in different ways of executing the request

krocodl opened this issue · 4 comments

Driver net.snowflake:snowflake-jdbc:3.14.2

First of all we need to prepare the table with 10_000_000 rows. we can do it by the next SQL:

create or replace table long_table as select row_number() over (order by null) as id from table (generator(rowcount =>10000000)) `

After that we can implement the test, which tries to download a lot of simulated data with minimal usage of memory:

` @test
public void bigDataTest() throws Exception {
int colSize = 1024;
int colCount = 12;

    String colValue = "'" + "x".repeat(colSize) + "'";
    String query = "select id, " +
            String.join(", ", Collections.nCopies(colCount, colValue)) +
            " from long_table";

    try ( Connection conn = getConnection() ) {
        try (Statement statement = conn.createStatement()) {
            statement.execute("ALTER SESSION SET CLIENT_MEMORY_LIMIT=100");
            statement.execute("ALTER SESSION SET CLIENT_PREFETCH_THREADS=1");
            statement.execute("ALTER SESSION SET CLIENT_RESULT_CHUNK_SIZE=48");
            statement.execute("ALTER SESSION SET JDBC_QUERY_RESULT_FORMAT=JSON");
        }

        PreparedStatement ps = conn.prepareStatement(query, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
        ps.setFetchSize(10);
        ps.setMaxRows(-1);
        ResultSet rs = ps.executeQuery();

        int count = 0;
        while (rs.next()) {
            if (count++ % 1_000 == 0) {
                LOG.info("Count = {}", count);
            }
        }
    }
}`

The count of chunks, which will be downloaded, we can observe in the variable SFStatement#executeQueryInternal#result. The value of this variable is JSON document, which contains the list of reference on the chunk's URLs as "data.chunks" path. The size of this list is interesting for us - the larger the list, the smaller the size of one chunk and the less memory is used in the process of retrieving data. The following data is obtained with different values ​​of ps.setMaxRows():

  • ps.setMaxRows(Integer.MAX_VALUE) => data.chunks.size= 469
  • #ps.setMaxRows(Integer.MAX_VALUE); => data.chunks.size= 2453
  • ps.setMaxRows(0); => data.chunks.size= 2453
  • ps.setMaxRows(-1); => data.chunks.size= 469

At the same time, we see that with ps.setMaxRows(Integer.MAX_VALUE)
and ps.setMaxRows(-1); the value of the CLIENT_RESULT_CHUNK_SIZE configuration parameter is not taken into account at all when calculating the number of chunks.

The following graphs illustrate the actual memory consumption in two fundamentally different cases (data.chunks.size=2423 and data.chunks.size= 469)

Screenshot 2023-11-15 at 17 12 14
Screenshot 2023-11-15 at 16 55 25

In the second case from time to time we receive OOM due to the process of re-downloading chunks. This is another problem - in the process of re-requesting an incorrectly received chunk, this chunk is still held in memory, so the actual consumption will be CLIENT_RESULT_CHUNK_SIZE * 3 instead of the values ​​stated for the given values ​​of the variables CLIENT_RESULT_CHUNK_SIZE * 2

  1. Without ALTER SESSION SET JDBC_QUERY_RESULT_FORMAT=JSON the behaviour is the same
  2. The larger the size of one chunk, the more likely it is that there will be several attempts to download it, which will lead to additional memory consumption. Therefore, it is quite possible that reducing the size of chunks will not increase the time it takes to obtain the result of the query as a whole

BTW, there is an error in the documentation:
CLIENT_MEMORY_LIMIT is not global maximum memory usage limit for all queries. It is the property of the SnowflakeChunkDownloader and each SnowflakeResultSetSerializableV1 (each query under execution) has one own instance of this class.

@sfc-gh-wfateem could you please triage this issue?

hi all - unfortunately this is a limitation across all of the Snowflake drivers and likely will need server-side changes too. The relevant (multiple) teams are aware, and scoping and prioritizing this issue is in progress. No timeline can be promised at this point though, unfortunately.

However if you're perhaps already a Snowflake customer, do reach out to your Account Team please and phrase how important this change would be for you to get. This might give the efforts some traction and help the involved teams prioritize. Thank you in advance!