snowflakedb/snowflake-kafka-connector

Connector tasks stuck in RESTARTING state

Closed this issue · 1 comments

We had a Snowflake outage today and several connector tasks failed. We usually just restart the failed tasks after outage is resolved and it works. But we are seeing something weird today. Connector tasks are stuck in RESTARTING state. Restarting the whole connector also does not resolve them.

The only option left for me to try is to delete and re-create the connectors but I don't want to do that as it will re-process all the data from Kafka. So that's the last option I have.

This is the common exception I'm seeing in the logs:

Apr 04, 2024 5:21:50 PM net.snowflake.client.core.HeartbeatBackground runSEVERE: heartbeat error -
message=!200062!net.snowflake.client.jdbc.SnowflakeSQLException: !200062!	at
net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:402)	at
net.snowflake.client.jdbc.RestRequest.execute(RestRequest.java:66)	at
net.snowflake.client.core.HttpUtil.executeRequestInternal(HttpUtil.java:742)	at
net.snowflake.client.core.HttpUtil.executeRequest(HttpUtil.java:677)	at
net.snowflake.client.core.HttpUtil.executeGeneralRequest(HttpUtil.java:599)	at
net.snowflake.client.core.SFSession.heartbeat(SFSession.java:789)	at
net.snowflake.client.core.HeartbeatBackground.run(HeartbeatBackground.java:192)	at
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)	at
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)	at 
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor
java:304)	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)	at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)	at
java.base/java.lang.Thread.run(Thread.java:833)

Also, an HTTP request to update connector config via curl -X PUT connectors/{NAME}/config is timing out. There are other connectors on same kafka connect instance running fine.

[Edit] - I tried deleting a connector and re-creating it, but the request is still timing out. I can confirm that network communication is not a problem as other GET requests are working as expected.

Marking this as resolved - turns out it was underlying k8s issue.