SNOW-1503887: Snowflake Driver Stuck opening Session

Please answer these questions before submitting your issue.
In order to accurately debug the issue this information is required. Thanks!

What version of JDBC driver are you using?
3.16.1
What operating system and processor architecture are you using?
ubuntu jammy arm64
What version of Java are you using?
21.0.1

I can't provide the exact code, but I'm simply trying to open a jdbc driver connection to snowflake. This code usually works 99% of the time, but in the last ~week we have been seeing intermittent issues with the snowflake driver hanging indefinitely (e.g. 10+ hours) attempting to open a new session. The same code for the same customer will work if you try it in another process 5 minutes later, so we haven't been able to reproduce locally, as there's some sort of non determinism happening here. It primarily seems to be happing on accounts where the login code produces a HTTP status=513 error. Our code usually hits the default login timeout window, but sometimes it just hangs forever on the following stacktrace.

The stack trace of the hanging code is:

"listener-worker-231" #16196 [16823] prio=5 os_prio=0 cpu=198.77ms elapsed=26212.65s tid=0x0000ffffa5c1a600 nid=16823 waiting on condition  [0x0000fffe6e7fc000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep0(java.base@21.0.1/Native Method)
	at java.lang.Thread.sleep(java.base@21.0.1/Unknown Source)
	at net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:437)
	at net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:72)
	at net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:735)
	at net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:668)
	at net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:585)
	at net.snowflake.client.core.SessionUtil.newSession(SessionUtil.java:645)
	at net.snowflake.client.core.SessionUtil.openSession(SessionUtil.java:310)
	at net.snowflake.client.core.SFSession.open(SFSession.java:630)
	- locked <0x0000000318000288> (a net.snowflake.client.core.SFSession)
	at net.snowflake.client.jdbc.DefaultSFConnectionHandler.initialize(DefaultSFConnectionHandler.java:125)
	at net.snowflake.client.jdbc.DefaultSFConnectionHandler.initializeConnection(DefaultSFConnectionHandler.java:98)
	at net.snowflake.client.jdbc.SnowflakeConnectionV1.initConnectionWithImpl(SnowflakeConnectionV1.java:142)
	at net.snowflake.client.jdbc.SnowflakeConnectionV1.<init>(SnowflakeConnectionV1.java:122)
	at net.snowflake.client.jdbc.SnowflakeDriver.connect(SnowflakeDriver.java:214)
	at com.zaxxer.hikari.util.DriverDataSource.getConnection(DriverDataSource.java:121)
	at com.zaxxer.hikari.pool.PoolBase.newConnection(PoolBase.java:359)
	at com.zaxxer.hikari.pool.PoolBase.newPoolEntry(PoolBase.java:201)
	at com.zaxxer.hikari.pool.HikariPool.createPoolEntry(HikariPool.java:470)
	at com.zaxxer.hikari.pool.HikariPool.checkFailFast(HikariPool.java:561)
	at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:100)
	at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:81)

I've been looking at and stepping through the underlying code, and believe I see a few potential issues and reason for this:

take this block below from RestRequest that has the sleep that is guarding the hang.

if (backoffInMilli > elapsedMilliForLastCall) {
          try {
            logger.debug("sleeping in {}(ms)", backoffInMilli);
            Thread.sleep(backoffInMilli);
            elapsedMilliForTransientIssues += backoffInMilli;
            if (isLoginRequest) {
              long jitteredBackoffInMilli = backoff.getJitterForLogin(backoffInMilli);
              backoffInMilli =
                  (long)
                      backoff.chooseRandom(
                          jitteredBackoffInMilli + backoffInMilli,
                          Math.pow(2, retryCount) + jitteredBackoffInMilli);

When I test locally, I see that we pretty quickly reach a state where our backoff is negative, since we compute the backoff relative to the backoff itself and not a static base value. This means if you get unlucky from backoff.chooseRandom, we could get a bunch of negative jitters, and then quickly end up with a negative or very low backOff value which causes the loop to spin forever until the top level timeout login is reached.

The other issue is long elapsedMilliForLastCall = System.currentTimeMillis() - startTimePerRequest;. Since System.currentTimeMillis() is not monotonic (see this thread) it's possible that this call could actually jump back and cause elapsedMilliForLastCall to be negative or lower than it actually should be.

So given these issue above, I believe what is happening is that sometimes the backOffInMilli become a very low value like 2 millis, which is usually above elapsedMilliForLastCall. We skip the backoff part of the loop and quickly jack up the number of retries since we end up just making requests in a loop.

So if we have backoffInMilli = 1
and retryCount=60,

then and then get an unlucky tick on the last request and elapsedMilliForLastCall drops below 1,
we re-enter the loop.

When we hit Math.pow(2, retryCount) + jitteredBackoffInMilli) this line, we do 2^60, get an absolutely massive number, and then end up having our next request essentially sleep infinitely.

Proposed fixes:

I think there's a few approaches to fix.

The easiest is to just ensure a max value for backoffInMilli e.g. so that we never sleep more than say 3 minutes.
the second approach would be updating how elapsedMilliForLastCall is calculated to use System.nanoTime which is guaranteed to never jump negative.
fix the exponential backoff function to not be able to go down. The existing one seems buggy because it's relative to the old value and not a static base. E.g. I very often observe a negative backoff in my local testing, which doesn't seem useful.

I'm happy to PR fixes if this doesn't make sense, or discuss further.

Hey @tpoll,

Thanks for providing such a detailed explanation. A quick question for you, when you say you tested this locally, did you create a unit test that easily reproduces the issue, or was this just merely running a Java application using Hikari that requests several connections at a time? I'm just trying to figure out the easiest way to reproduce this if possible.

I was merely running an application locally to connect to a snowflake account (that seemed itself to maybe not exist anymore) and stepping through the code.

I think to reproduce, you just need to attempt to sign in to an account where the login will always fail with a 500 style error and force a lot of retries.

Thanks for clarifying. I'm unlikely going to hit a 50x error if I leave it up to chance.

I'll see if I can work out something. There should be a way to mock this.

@tpoll so as a starting point, I wanted to see if I can reproduce this with the latest version 3.17.0 because there was a new retry strategy that was implemented recently. I ran a test where I sent requests through a proxy server and I intercepted the HTTP requests so that every single response returned to the JDBC driver had a status code of 503. The JDBC driver kept retrying until it eventually gave up with the error:

Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error. Message: HTTP status=503.
	at net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:843)
	at net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:737)
	at net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:652)
	at net.snowflake.client.core.SessionUtil.newSession(SessionUtil.java:662)
	at net.snowflake.client.core.SessionUtil.openSession(SessionUtil.java:311)
	at net.snowflake.client.core.SFSession.open(SFSession.java:637)
	at net.snowflake.client.jdbc.DefaultSFConnectionHandler.initialize(DefaultSFConnectionHandler.java:129)
	at net.snowflake.client.jdbc.DefaultSFConnectionHandler.initializeConnection(DefaultSFConnectionHandler.java:102)

While the issue you pointed out is worth looking into, I believe the impact is that you're going to have very short backoff times, or none at all (negative value). However, the code should be keeping track of the retry attempts and will eventually stop. It's unclear why in your case it's hanging indefinitely, there could be something else going on, but I did want to ask if it was possible for you to try your tests again locally with version 3.17.0 and let me know if you were able to reproduce the problem or not.

Update: I just tried with v3.16.1 and that was a lot more liberal with the retries, so it retried the login request 39 times but it eventually failed with the same error.

@tpoll I'll be interested to know what the outcome is on your end when testing this with v3.17.0.

Hey @sfc-gh-wfateem here is a gist of me trying locally https://gist.github.com/tpoll/953ecdb5dd141aa844675473924c0768, you can still see there are many very fast retries.

I believe the problem still exists in 3.17. It's non deterministic, so I haven't been able to repro the hanging locally with debug logging enabled, but my clients would always get stuck on this thread sleep when I took thread dumps of my production instances.

snowflake-jdbc/src/main/java/net/snowflake/client/jdbc/RestRequest.java

Line 498 in 823cecc

Thread.sleep(backoffInMilli);

I think the issue is specifically the rare set of conditions outlined in my issue description that causes this line to be hit

snowflake-jdbc/src/main/java/net/snowflake/client/jdbc/RestRequest.java

Line 506 in 823cecc

Math.pow(2, retryCount) + jitteredBackoffInMilli);

and then the code takes 2^~100s of the retry count.

I think the easiest solution would be to just update Thread.sleep(backoffInMilli); to be Thread.sleep(Math.max(backoffInMilli, maxBackoffInMilli);

@sfc-gh-wfateem I was actually able to repro locally by chance with debug logging and confirm the bug is what I suspected.

See https://gist.github.com/tpoll/6e1459978e99d28f46f423d75f4f3106

The final log lines show:

2024-07-10 18:08:01.848 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] Request POST REDACTED HTTP/1.1 guid: 7ebd421b-44cb-4ec1-9ec6-0dd775e7f2c0
2024-07-10 18:08:01.929 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] HTTP response not ok: status code: 513, request: POST REDACTED HTTP/1.1
2024-07-10 18:08:01.930 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] Retry request POST REDACTED HTTP/1.1: sleeping for 82 ms
2024-07-10 18:08:02.017 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] Retry count: 129, max retries: 0, retry timeout: 30 s, backoff: 9223372036854775807 ms. Attempting request: POST REDACTED HTTP/1.1
2024-07-10 18:08:02.017 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] Request POST REDACTED HTTP/1.1 guid: 80927eb4-6602-4a11-b80a-1beba80642bf
2024-07-10 18:08:02.102 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] HTTP response not ok: status code: 513, request: POST REDACTED HTTP/1.1
2024-07-10 18:08:02.102 [main] net.snowflake.client.jdbc.RestRequest: [requestId=b7e53f66-4262-4acb-903b-42b8f00914ab] Retry request POST REDACTED HTTP/1.1: sleeping for 9223372036854775807 ms

Sleeping for 9223372036854775807 ms

@tpoll Thanks for that. I'll discuss this with the team.

I was also able to reproduce the issue, just FYI.

@tpoll PR #1822 was created and is in review. Feel free to add any comments in the PR if you have any additional observations.

@tpoll the PR is merged and will be included in the next release.

Thanks!

@tpoll JDBC driver version 3.18.0 has been released with the fix.
I'll go ahead and close off the issue, but please reach out again if you're still seeing issues with this.