microsoft/mssql-jdbc

issue with jdbc The connection is broken and recovery is not possible

sunilsankar opened this issue · 4 comments

I have been using the following jdbc connection string in our aks cluster to sql server. And the connection works but after idle I see this error connection is broken and recovery is not possible. Is there any keepalive setting which I can use to keep the session active similar to oracle jdbc (SO_KEEPALIVE=true)(TCP_KEEPIDLE=60)(TCP_KEEPINTVL=30)(TCP_KEEPCNT=15)). I am not able to find the same in microsoft documentation.

jdbc:sqlserver://xxxxx.database.windows.net;database=xxxxxdb;encrypt=true;trustServerCertificate=true;loginTimeout=120;authentication=ActiveDirectoryPassword
```

Hi thanks for your question.

I would like to understand more about what is happening. Can you provide more details about the failure scenario (e.g. is it happening during a long running job)? Also could you please provide logs that capture this error event? What version of the JDBC driver are you using? Which OS are you using and what is the Java version? Is Java connection pooling being used?

The connection recovery depends on specific unrecoverable scenarios which are documented here - https://learn.microsoft.com/en-us/sql/connect/jdbc/connection-resiliency?view=sql-server-ver16#connection-recovery. I suspect we will see one of these events in the logs.

The JDBC driver does not support a Keep Alive feature. However you can manually configure such a feature in the OS depending on your OS. Here are the instructions for Configuring the Keep Alive on Windows - https://learn.microsoft.com/en-us/sql/connect/jdbc/connecting-to-an-azure-sql-database?view=sql-server-ver16#connections-dropped

Please note, if you are setting the Keep Alive on the OS, it will be a global setting impacting all connections.

I look forward to hearing more about this issue.

Currently we are using mssql-11.2.0.jre11.jar . It is running in pod rhel8 image in AKS cluster. We are using Java connection pool . Noticed that this happens when it is idle and all of a sudden transaction takes place.

We also add the following settings connectRetryCount=10;connectRetryInterval=10

jdbc:sqlserver://xxxx-xxxx-dev.database.windows.net:1433;database=XXXXXX;encrypt=true;trustServerCertificate=true;loginTimeout=120;connectRetryCount=10;connectRetryInterval=10;authentication=ActiveDirectoryPassword

The keepalive setting in the pod is as follows

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

This is error we see in the log of the application

 [Database] SqlServer error: code = '0', SQL state = '': while committing: The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.
 0 | 2024-01-02 10:48:36 | XXXXX | DBG0000 | F | could not commit pending transactions: The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

basically what I noticed is under idle condition the driver becomes unrecoverable. We need to restart to connect again to the database

I find it slightly confusing that you are reporting that the connection is idle when the issue happens. The error message indicates that there is an open transaction pending when the connection is lost. That would imply that there is an ongoing DB operation. Could you please look into what DB operations(s) is in the transaction.

The driver does have built-in keep-alive packets, but this feature is not configurable. If the keep-alive packets are enabled you will see the following in the logs:
"Setting KeepAlive extended socket options"
If keep-live packets are not available you will see the following in the logs:
"KeepAlive extended socket options not supported on this platform"

Assuming keep-alive packets in either the driver and\or the OS are available, that would mean that there is another process that is interrupting the connection. For example if you are using a Proxy connection when there is a 30 minute inactive connection the Azure Gateway would kill the connection (https://techcommunity.microsoft.com/t5/azure-database-support-blog/azure-sql-idle-sessions-are-killed-after-about-30-minutes/ba-p/3268601). At this point it appears that this is a likely explanation for this issue.

It's been some time since we last heard from you. As such we are going to close this ticket.

If you have any further questions, please feel free to reopen the ticket and we will be glad to help.