/TrueGrit

A data-driven, functionally-oriented, idiomatic Clojure library for circuit breakers, bulkheads, retries, rate limiters, timeouts, etc.

Primary LanguageClojureMIT LicenseMIT

README

clojars badge cljdoc badge

True Grit

For when you need a function that won’t give up at the first sign of failure.

True Grit

About

True Grit enables flexible responses to function failure. True Grit is a data-driven, functionally-oriented, idiomatic wrapper library for using Resilience4j, one of the top resilience Java libraries.

It offers:

  • timeouts - throw an exception if a function takes too long

  • rate limiters - limit the number of calls to a function per time period

  • retries - retry a function if it fails

  • circuit breakers - track function failure rates, and block calls to the fn if it fails too often

  • bulkheads - to reserve capacity and partition usage

The most common way to use True Grit is to wrap functions that call out to an underlying network service. With True Grit, you can do things like say "Give this network call 5s before it’s considered to time out, retry it up to 3 times in case of intermittent network failure, track how many calls fail overall, and if more than 50% do, temporarily halt the function so we don’t hammer the downstream service."

Start with the net.modulolotus.truegrit namespace, and see the individual policy namespaces if you have more advanced needs. It contains all-in-one functions that take a function and a config map, and return a wrapped function with the resilience policy attached. All wrapped functions return the same results as the original (except for thread-pool-based bulkheads, which return Futures).

Docs are available here.

True Grit is stable and largely done. The lack of recent commits does not mean it is abandoned. I am still actively maintaining it, but the only planned future work is for bug fixes and Resilience4j updates. Suggestions for feature requests welcome, but no promises.

Installation

Add the following to your Leiningen project.clj:

[net.modulolotus/truegrit "2.2.32"]

Add the following to your deps.edn:

net.modulolotus/truegrit {:mvn/version "2.2.32"}

Why should you use it?

Wrapper libraries are frequently more hassle than doing interop directly. But in Resilience4j’s case, while it’s an excellent library, it requires coordinating dozens of Java classes to get anything done. So, if you don’t feel like wrangling a bunch of Config/Builder/Registry/etc classes, True Grit is for you.

Dependencies

As of Resilience4j 2.x and thus, True Grit 2.x, the minimum Java version is 17. If you need support for earlier Java versions, stick with the 1.x versions of True Grit. (It has the exact same API and functionality.)

Documentation

Before using circuit breakers and bulkheads, be sure to understand how they operate. I highly recommend Release It! to understand the ways distributed systems can fail and how to compensate for them.

See:

Examples

Basic usage
(require '[net.modulolotus.truegrit :as tg])

(def resilient-fn
  (-> flaky-fn
      ;; Give each individual call up to 10s to complete
      (tg/with-time-limiter {:timeout-duration 10000})

      ;; Try up to 5 times, waiting 1s between failures
      ;; Will retry if an exception is thrown by default, but will also retry if
      ;; the return value is nil
      (tg/with-retry {:name            "my-retry"
                      :max-attempts    5
                      :wait-duration   1000
                      :retry-on-result nil?})

      ;; If it still fails after 5 tries, record it as a failure in the CB
      ;; CB will go into OPEN status if 20% of calls end up failures
      ;; CB will wait for at least 40 calls before considering a change in status,
      ;; giving it time to warm up.
      ;; Ignores UserCanceledExceptions, since if the user hit "Cancel", it's not a
      ;; problem in the underlying service
      (tg/with-circuit-breaker {:name                    "my-circuit-breaker"
                                :failure-rate-threshold  20
                                :minimum-number-of-calls 40
                                :ignore-exceptions       [UserCanceledException]})))
Use a shared circuit breaker to track an underlying service called by many fns
(require '[net.modulolotus.truegrit.circuit-breaker :as cb])

(def rest-service-cb (cb/circuit-breaker "shared-rest-service"
                                         {:failure-rate-threshold 30
                                          :minimum-number-of-calls 10}))

(def resil-get (cb/wrap flaky-get rest-service-cb))
(def resil-post (cb/wrap flaky-post rest-service-cb))
(def resil-put (cb/wrap flaky-put rest-service-cb))
(def resil-patch (cb/wrap flaky-patch rest-service-cb))
(def resil-delete (cb/wrap flaky-delete rest-service-cb))
Check circuit breaker to choose an alternative method if status is OPEN
(require '[net.modulolotus.truegrit.circuit-breaker :as cb])

(if (-> resilient-fn
        (cb/retrieve)           ; retrieve associated CircuitBreaker
        (cb/call-allowed?))     ; is a call allowed right now?
  (resilient-fn)                ; if so, make the call
  (some-fallback-fn))           ; if not, we can't wait, try a fallback
Use semaphore-based bulkheads to limit database access, keep 20% capacity in reserve, and log reserved metrics
(require '[net.modulolotus.truegrit.bulkhead :as bh])

(defn database-query-fn
  "Some database fn that we've determined can only handle 100 simultaneous queries"
  [user]
  ;; do some db stuff
  )

;; Make a default version that can use up to 80% of the database's capacity
(def default-database-query (tg/with-bulkhead database-query-fn
                                              {:name "default-db-bulkhead"
                                               :max-concurrent-calls 80}))

;; Make a version that reserves 20% for special needs
(def reserved-database-query (tg/with-bulkhead database-query-fn
                                               {:name "reserved-db-bulkhead"
                                                :max-concurrent-calls 20}))

;; Usage
(defn some-handler-fn
  [user]
  (if (user-is-special-somehow user)   ; Is the user a VIP, sysadmin, etc?
    (reserved-database-query user)     ; Make reserved call - the default bulkhead being full has no impact here
    (default-database-query user)))    ; Make standard call, blocking if unavailable

;; Log reserved bulkhead metrics every 10s
(future
  (loop []
    (-> reserved-database-query
        (bh/retrieve)
        (bh/metrics)
        (log/debug))
    (Thread/sleep 10000)
    (recur)))

Guidelines and Notes

Circuit breaker status shorthand

CLOSED is good, OPEN is bad. Think of electricity flowing.

Read up on bulkheads and circuit breakers before using them

Seriously.

Circuit breakers should never be created on-demand

Circuit breakers work by collecting data about a function’s success/failure rate over time. If you create a CB on the fly (like for an anonymous fn), but you only call that particular fn one time, the CB is useless. If you need to construct fns on the fly, but still track their overall success, you should create a CB ahead of time, and share it with all the anonymous fns by using cb/wrap.

Retries only make sense if there’s a reasonable expectation the fn will succeed within an acceptable time frame

They’re better-suited for temporary glitches in the matrix, not a service being down all day. If the fn doesn’t succeed in time, retries can make things worse, by adding to the downstream load, which is why pairing them with circuit breakers works well.

Be mindful of interactions at different levels of the system

E.g., wrapping a high-level fn with a retry policy of 3 attempts that calls an AWS client lower down that also has its own internal retry policy of 3 attempts can result in up to 3x3=9 calls under failure modes, exacerbating things.

Another common example is having multiple timeouts; it’s confusing and pointless, since the shortest timeout will trigger first.

You still need to handle errors

No amount of resilience policies can ensure a function will always succeed.

Order of wrapping matters

E.g.:

(-> my-fn
    (with-retry some-retry-config)
    (with-time-limiter some-timeout-config))

will retry several times, but if the time limit is up before the tries succeed, it will fail. This is probably not what you want. On the other hand:

(-> my-fn
    (with-time-limiter some-timeout-config)
    (with-retry some-retry-config))

will make calls with a certain time limit, and only if they return failure or exceed their time limit, will it attempt a retry. If you want a canonical "good" ordering, see the robustify example fn in the source.

Non-goals

The r4j cache module is currently unsupported, since many Clojure/Java caching libraries already exist. However, it could be included, if people are interested. Let me know if you want it, or better still, submit a patch.

Supporting all the Java frameworks that r4j interoperates with is also a non-goal for now.

Future directions

The r4j registries add virtually nothing over standard Clojure mutable containers, but the code I wrote for them still exists, so I could add them back if people really need them.

Metric module support may be added, if anyone expresses a need for it.


© 2024 Matthew Davidson