Bug: Unhandled errors in websockets can lead to panics
Closed this issue · 4 comments
Describe the bug
https://github.com/surrealdb/surrealdb.go/pull/118/files
Not much detail, but there is a code line. Apparently, when the error isnt handled correctly, the websocket connections can panic.
The TODO is going to be added to the codebase
Steps to reproduce
Unknown, may need to review code
Expected behaviour
No panics, errors handled correctly
SurrealDB version
1.0.0+20231130.a420c821.dirty for macos on aarch64
Contact Details
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct
With the NewTestSurrealDB
function in the PR #118, I can easily reproduce this error, by simply shutting down SurrealDB server and sleep for a second.
func TestSurrealDBPanic(t *testing.T) {
t.Parallel()
_, db, close := NewTestSurrealDB(t)
_ = db // db implicitly holds websocket connection, and can panic when SurrealDB is not accessible
close()
time.Sleep(1 * time.Second)
}
I agree with @ElecTwix -- when the connection is lost, it would be the best if the server can go into reconnection loop. I'd imagine there would be some adjustment needed for the read path as well, but we should first focus on server to automatically recover (in that sense panic needs to be addressed).
Hi everyone,
I have created a diagram for implementing reconnection and reauthentication.
What are your thoughts?
Thanks so much for this!
I agree that implicit retries are favourable. We need to be careful how we go about it though - we don't want indefinite loops, and we do want to fail-fast in many cases (such as deploying to a cluster where SurrealDB hasnt yet started or isn't reachable).
The pattern for this is called a circuit breaker. We should explore how circuit breakers are implemented in golang (I haven't used them myself, only in-house solutions). There could be frameworks that help with this.
The number of retries and timeout durations must be configurable. Some people may want to have no retries and short timeouts, for example.
Additionally, if the retries have failed, there cannot be a panic, and instead must be a checkable error code. Then users can handle logic on their side - reconnect, reconnect with new address, or terminate the client.
Regarding the diagram for the approach - I think this is great @ElecTwix thank you! It looks sensible. I would advise having the circuit breaker pattern in the db.go file. Configuration should be outside db.go, though. Db.go should use the configuration provided.