surrealdb/surrealdb.go

Bug: Unhandled errors in websockets can lead to panics

Closed this issue · 4 comments

Describe the bug

https://github.com/surrealdb/surrealdb.go/pull/118/files

Not much detail, but there is a code line. Apparently, when the error isnt handled correctly, the websocket connections can panic.

The TODO is going to be added to the codebase

Steps to reproduce

Unknown, may need to review code

Expected behaviour

No panics, errors handled correctly

SurrealDB version

1.0.0+20231130.a420c821.dirty for macos on aarch64

Contact Details

hugh@surrealdb.com

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Hi @phughk , I could take this.
It's worth noting that some of Gorilla WS's errors cannot be recovered, such as connection errors.
I was thinking of fixing the reconnection issue as well in PR because it is the root cause.
What do you think?

reconnection issue: #110

With the NewTestSurrealDB function in the PR #118, I can easily reproduce this error, by simply shutting down SurrealDB server and sleep for a second.

func TestSurrealDBPanic(t *testing.T) {
	t.Parallel()

	_, db, close := NewTestSurrealDB(t)
	_ = db // db implicitly holds websocket connection, and can panic when SurrealDB is not accessible
	close()

	time.Sleep(1 * time.Second)
}

I agree with @ElecTwix -- when the connection is lost, it would be the best if the server can go into reconnection loop. I'd imagine there would be some adjustment needed for the read path as well, but we should first focus on server to automatically recover (in that sense panic needs to be addressed).

Hi everyone,
I have created a diagram for implementing reconnection and reauthentication.
What are your thoughts?

Reconnect
full view

Thanks so much for this!

I agree that implicit retries are favourable. We need to be careful how we go about it though - we don't want indefinite loops, and we do want to fail-fast in many cases (such as deploying to a cluster where SurrealDB hasnt yet started or isn't reachable).

The pattern for this is called a circuit breaker. We should explore how circuit breakers are implemented in golang (I haven't used them myself, only in-house solutions). There could be frameworks that help with this.

The number of retries and timeout durations must be configurable. Some people may want to have no retries and short timeouts, for example.

Additionally, if the retries have failed, there cannot be a panic, and instead must be a checkable error code. Then users can handle logic on their side - reconnect, reconnect with new address, or terminate the client.

Regarding the diagram for the approach - I think this is great @ElecTwix thank you! It looks sensible. I would advise having the circuit breaker pattern in the db.go file. Configuration should be outside db.go, though. Db.go should use the configuration provided.