albertwidi/sqlt

Explore the possibility to add circuit breaker in SQLT

wendyadi opened this issue · 8 comments

@albert-widi

Is it possible to add a kind of circuit breaker in SQLT?

Circuit-breaking in database is bad In my opinion, but if some people want it. I can add it.

Why? For example we have a circuit-breaker in handler and database, and each of them have different break time.
T1 - Circuitbreaker for database open
T2 - Citcuitbreaker for handler open, because database breaker open
T3 - Breaker for database should be open, but request cannot come in because breaker in handler
T4 - Breaker in handler open and trying to access database

What if we add more variables into it? Like database breaker is for 30 sec, but handler breaker is for 10 sec? etc

Circuit-breaking is common in gateway and proxy, it is used in applications for some cases. We might feel safe by stacking more layer for errors handling, but the drawbacks and problems created by it should be considered.

I am totally not against handling more error gracefully and recover, but this seems to be overkill and not thoroughly thought out. But I can add options for sqlt if you want to, this is all my concerns.

Right now we always have problem when latency between app server and db server goes up or when the connection between app server and db server is cut. I just think about circuit breaker on database connection to overcome this problem, so that our app won't crash because of too many connections attempting to connect to database.

@albert-widi Do you have better idea for this purpose? You mention about handler, do you mean we just limit the connection in handler so that we won't have too many connections?

This is an example for middleware https://github.com/albert-widi/go_common/blob/master/router/router.go#L54 but not yet added as the component of the router.

If we have circuit-breaker in the handler, then(in my opinion) all circuit-breaker behind the handler means nothing. This is because all requests to the service will be deflected first by the handler middleware(top service layer).

A timeout should be introduced to all external dependencies. If timeouts happened, then circuit-breaker should automatically open and rejecting the requests.

As I said earlier, this circuit-breaker is actually heavily used in proxy/API-gateway. They usually have the break and retry mechanism(CMIIW). This concept is convenient and is effective to mitigate the error limit, but I don't want us to overused this concept and ended using it in wrong places.

Thanks for your explanation Albert.
So in summary, circuit breaker mechanism in database connection is not needed.

Do you have advice how to avoid too many open files when we got high latency or network down between app and db? Is it enough to set timeout in db connection and limit concurrent connection to db? And implement circuit breaker in handler for incoming request so that the max number of open files won't be reached?

Please enlighten us.

Sorry for my late response, I am still on vacation and will reply when available.

Yes, in summary because of the stack, we don't need any circuit-breaker in data layer.

Too avoid too many open files you need a combination of circuit-breaker and timeout/context cancellation. This is so far worked well in wallet when we have connection/service problem with wallet-oauth. The timeout I mentioned is you need to set a timeout for your handler, don't rely on client cancellation, set your own timeout in handler/router. This way you can also pass the context to data layer and cancel any operation when cancellation in handler/router is triggered.

This is important and I think we need to create a more standard use of circuit-breaker and timeout-middleware.

Heads up @fabiantowang @wendyadi I read some interesting blog https://istio.io/blog/istio-0.2-announcement.html. Istio now supports resilence features like circuit breakers. Please give me some time to evaluate and test if circuit breakers are actually do-able.

@albert-widi wow, you're really great and you have excellent dedication to Tokopedia and our engineering process.

Please explore and teach us how to make improvement to our resilliency, we need to learn much more from you.