jlouis/fuse

Control the service restart after blown

posilva opened this issue · 35 comments

Hi,

I am using fuse to control the access to a backend service (DAL API). It would be nice that we have some good way of passing from blown to ok gradually. Otherwise, if the backend is under load (502/503), for example, the requests will be back on charge all at the same time, after the "heal" interval and can cause problems again.

Thank you,

(I will be able to implement a solution if you think that may be useful)

Pedro

Yes, this is a good idea, which I've also considered implementing.

The problem with its implementation is "how are you going to build a quickcheck model for it?". You need to come up with a good way of describing what gradually become ok means, and hopefully in a "deterministic" way. One of doing so is to control the RNG from the model so you can decide what the outcome of RNG lookups are.

The other problem is how you are going to let a few through. The fuse is an ETS table lookup, so if you flip that to 'ok' then the system will almost surely let a few through. So you would need some kind of "{gradual, Pct}" for some percentage, with the RNG controlled by the model.

This, and also its cousin of manually being able to disable/reenable fuses, are probably two of the most needed features.

If you come up with a better scheme, I can try to figure out if I can build a QC model for that.

Ok, this is doable if we just control the RNG in the test cases, which is fairly easy.

What do you think the configuration should look like? I think there are a number of things here:

  • We need to say when we need a gradual ramp-up.
  • We need to decide what to return to the caller when we are in gradual mode.
  • We need to say what period we use for the gradual ramp up, and how much to add over time. It lets a user say to gradually enable a service over 3 minutes, in 10 steps, say of 0, 10, 20, 30, 40 ... requests let through. In practice the system would then ramp every 180 div 10 = 18 seconds until it is fully working again. Or you could simply give it a ramping scheme: [{6000, 5}, {15*1000, 15}, 80*1000, 80}, {300*1000, 100}] would ramp up the system at 6, 15, 80 and 300 seconds with the amount given as the percentage.

I'm pretty sure I could build a quickcheck model for this kind of system, since I can mock the RNG and control its outcome, so I can say what the system should do in the different cases. I could also improve the timing mocking for this.

More thoughts:

  • The percentages are floating point numbers between 0.0 and 1.0. This allows one to supply fractions easily: 1/512. And so on.
  • Having a way to put the system into a permanent gradual mode where a fuse is blown sometimes is a very good way of testing that your system supports the circuit breaker in the first place. This would be a really nice configuration option to have on a fuse when defining its melting point. And the patch you propose would make that support way easier.

Some implementation plan for a QC model:

  • Go to a component based model for handling randomness and timing better.
  • Mock randomness
  • Mock timing, transplant jlouis/dht's timing component to this model.
  • Cluster

A first implementation should probably support a new type of fuse {fault_rate, Rate, Intensity, Period} which fault-injects every 1/Rate requests on average. This can verify the above model is in place and works.

Once you have support for this, it should be easy to add gradual ramping to the system.

The price to pay are parallel invocation models for this change, as they cannot be handled by such a system. So we would have to keep a parallel model around separately for this.

Hi,

I am not familiar with QuickCheck but now I have a good chance to learn about it, as soon as I have a model designed and something to show I will let you know.

We already have most of the model in test/fuse_eqc.erl, so that is a starting point. It needs to use component based models however, to handle what I'm suggesting above.

Yeah, we've discussed something similarly w/ our use of Fuse to handle Solr (and other third-party systems) issues (w/ solr_cores) under load. Being able to gradually pass from blown->ok would be a better model of how we expect our fuse-wrapped operations to eventually resolve. I'd be down for reviewing and/or helping w/ QC if there are questions too when I'm back around next week.

One important observation is that a standard fuse with a reset of 60*1000 would be a gradual fuse with [{60*1000, 1.0}] as in it goes to maximal rate in one step. This means we can handle the standard fuses as a special case of gradual fuses, which collapses a lot of the code base.

@jlouis yep... that observation makes 100% sense to me :).

Ok, #10 has a new fuse_eqc based on an eqc_component model. This model can handle a fault_injection type fuse and will verify the RNG components needed to support this issue as well. Things needed:

  • fuse:install/2 must be taught fault_injection fuses are valid. In the model and in the code.
  • A fault_injection fuse does not push ok to the ETS table, but {gradual, Rate}. The fuse:ask/1 command has already been taught to handle this.
  • The code must pass the model. There is no other task to do and we have fault_injection type fuses implemented.

The model has been taught about installing and handling fuses of fault_injection type. This completes the model. We just need to handle the code itself.

Looking forward:

  • Use eqc_component to directly run the timing component. Create a cluster with timing.
  • When a timer trigger keep track of the current gradual scheme in the model. Make sure the correct timer is applied and verify that the system sets the right gradual timing component.
  • Check for 1.0 and replace that with the default "ok" state of the fuse (ok or {gradual, Rate}).
  • There are probably validity rules for gradual schemes which we better define and check.

We can have a also another approach, instead of adding delay to the "ok" state we can fail even faster if we are in a "gradual interval":

  1. The fuse enters in the blown state
  2. After reset interval it will pass to ok
  3. If it fails in the short interval of time we will go again to blown otherwise we keep the ok
  4. After gradual interval without melt we are 100% operational.

With this approach in the case of the backend service recover well, we do not loose requests, but if the service starts to fail again we will have the chance to fail fast/sooner and back off for a some short period of time (depending on the fail rate). If the period between fails is small in the gradual interval we will have the chance to back off more time.

This fuse could be a fail_fast type.

I hope this idea is clear enough :)

I think it would make sense that in a "gradual" setting, we immediately fall back to error if it fails. I also think we can implement this with an update_counter/3 to the ETS table without accidentally ending up letting too many through. Of course, given the async context, we can't necessarily be totally void of races, but that is okay.

Perhaps with a bit more thinking, it is possible to figure out how this fuse type can be added to the system.

The reset policy is a command language. You give commands [C1, C2, C3, ...]. The possible commands are:

  • {delay, Ms} - Delay command processing for Ms milli-seconds. After that, proceed to the next command in the sequence.
  • {test, N} - Let N requests through. If they all complete without error, go on in the command sequence. Otherwise, start the command sequence over. This should be implementable with an ETS update_count/3,4 style message sequence.
  • {gradual, Rate} - If Rate = 0.05 we are letting 5% of all requests through to the service from here on in the command sequence. If they fail, they are subject to the standard Period/Intensity calculations.
  • heal - Heal the fuse completely.

The standard {reset, Ms} is encodable as [{delay, Ms}, heal] in this scheme. Gradual ramping is supported, and @posilva's ideas are supported as well. You can get any mix possible: e.g., [{delay, 60*1000}, {test, 3}, {gradual, 0.25}, {delay, 10*1000}, {gradual, 0.5}, {delay, 10*1000}, heal] would:

  • Wait 60 seconds upon failure
  • Test 3 requests, if all pass:
  • Ramp to 25% of all requests accepted for 10 seconds
  • Ramp to 50% of all requests accepted for 10 seconds
  • fully heal the fuse and go back to operational.

Hey,

@jlouis with the concept of having a reset command sequence, you this circuit breaker can deal with any type of backend recover policy, under pressure we can reinstall/reconfigure and adapt to a specific recover sequence. And of course this is also a good tool to test "frontend" systems behaviour when subject to "backend" failures.

Nice suggestion!!!

This reset sequence could be implemented with a gen_fsm?

The way to implement this is to first support a simpler variant, namely a reset policy {test, K, Ms} which will later expand into the notion of [{delay, Ms}, {test, K}] internally. It is a nice stepping stone toward the final solution we want, and we can then test the fuse behavior without having to implement all of the language in the first place.

First, we need to update the model. We need to stop tracking the blown state and directly calculate it from the melt history. This is more functional and has fewer moving parts. This allows us to add another way to track that the fuse is in a testing state, which becomes a special state of its own.

But by removing the blown tracking first and using the melt_history we can avoid having to specify a lot of the interaction between the two states. This hopefully simplifies the model and makes it easier to get correct.

The model update is #11 and it vastly simplifies the model.

The next step is to add a tracking in the model of a fuse being in the test state:

  • You can install fuses with the {test, K, Ms} keyword.
  • You once such a fuse "heals", it lets K requests through.
  • If any of those requests fails, it immediately fails again.
  • If all those requests succeeds, the fuse heals and becomes operational again.

It turns out we cannot use #11, so it is back to the drawing board, probably by accepting the complexity of the model and then adding the {test, K, Ms} setting.

The test command is fundamentally hard to add:

  • You need to make a call into the fuse_server. Otherwise we cannot know when we have given out enough test requests.
  • We cannot rely solely on a decrementing solution, because processes can die (so we need a monitor on them). Otherwise, we can have a deadlock if a process hangs forever.
  • We need to have a kind of "timeout" on a test request. Otherwise, a process can lock the fuse forever, which we don't want.
  • We either need a way for a fuse to say that it has succeeded, or we need an internal timing construction: Hand out K tests. If no melt is heard within Ms milliseconds, heal the fuse. But this requires you know what the typical timeout is, and when you are in that window of a possible timeout, you can't heal the fuse in the meantime.

This is possible to model all of these considerations, but it gets quite nasty since we will need eqc_component ?BLOCK style in order to handle this as well as sessioned ask/1 commands which may be followed by a melt/1. In other words, adding a {test, ...} style check has some quite severe deep implications on fuse. We need to think a bit more on this. The {gradual, ...} solution is not limited by these considerations at all, since it only cares about normal melt errors.

New idea, inspired by @lehoff in a loose way:

The thing that is hard with a {test, K, Ms} style command is that it starts to encode a lot of policy about what a test is into the fuse_server. This is hard to model, since all of a sudden, the model needs to take care of not only the fuse_server, but also any calling process, as they become part of the correctness of the system as a whole.

But if we supported {barrier, Term} we could make things work out I think:

  • The {barrier, Term} stops the fuse at that point in its processing. It returns {barrier, Term, Token} when asked. This is a cue to go to your own layered process and solve the barrier. Once you have solved the barrier tests, you either melt the fuse so it explodes again, or you call fuse:unlock_barrier(Name, Token) which unlocks the barrier and the fuse continues its processing.

In turn, we can now support any model you can think up outside the scope of the fuse system itself. A {test, K, Ms} model is fairly easy to set up, if you just keep a process around to resolve letting K callers try and then unlock if all of them are good, or melt the fuse into the blown state if any of them are bad. It keeps the fuse model fairly simple by itself. It punts the hard policy part to a system of your own, and it can support many complex models easily. Most importantly, it is easy to model.

With this model, the caller process has to call {barrier, Term} to control the flow in a certain point in time, after a blown, for example, and after the reset timeout the state when asked would be {barrier, Term, Token}. Or we could configure the reset to be a automatically the {barrier, Term} after the reset timeout?

Now we have {reset, Ms} for standard fuses and we could have a controlled_fuse with {reset, Ms, {barrier,Term}} after the reset timeout the fuse will go to this state and during this state (until we call fuse:unlock_barrier(Name, Token)) any melt will pass to the blown state (other wise we will not fail fast and have to wait to get all melts again)

An update: I added timing to the EQC model in #12 which has uncovered some bugs in fuse w.r.t. timing. I think @zeeshanlakhani / @lehoff might be interested in these and fixes of them going forward, so just pinging them :)

I'll probably build a point-release, but I'm not sure it will fly on release 16 yet. Backwards compatibility should be fairly easy though since there should be a time-compat module and a rand-compat module for handling the backporting. The rest of the code should be R16 safe, I think.

The problem is somewhat benign: if a fuse is melted too much just as it blows, then more than a single timer is set on the fuse. This can lead to fun situations when the timer clears again, but I don't think it will. However, that world is somewhat undefined behavior :)

If you want me to track the state explicitly for, say, Basho, just open an issue on this repo.

On this issue though: #12 implements the necessary timing scaffolding which eventually lets us model the proposal in this issue. It is a prerequisite step since it puts timing under the wings of the model and we now control time explicitly in a EQC component based cluster.

With #12 implemented, we can start modeling the real code for the system. This comment describes what is needed:

First, we must introduce a notion in the model of a command list. Given a fuse, it's reset policy is given as a list of commands, and we have a "next" command which explains the current state of the fuse. If we have, say, [{delay, 30}] and we get a timing event for the delay, we proceed, to the state [] and then carry out this by healing the fuse.

When the fuse is blown, we start processing this list. We introduce an internal call to "process commands" which then places the model in the correct state. We can implement this in the model without altering the SUT, and we can make it "backwards compatible". Once this is in place, we have the necessary stepping stone to implement the remainder of commands.

  • {delay, Ms} will set a timer event, and then proceed. If the timer triggers, the head must be {delay, Ms} and it is chopped off the command list and we process the remaining command list.
  • heal is not needed. The empty list encodes heals.
  • {gradual, Level} is implemented by altering the fuse state to a gradual fuse, then proceeding by executing the next command.
  • {barrier, Term} puts the fuse into the barrier state. The fuse will keep being in this state until an unlock_barrier/2 command is executed agains the fuse. Once unlocked, the next state proceeds to execute.

We have the first part of a command processor for this issue. It is implemented in #14 and is going into the model soon.

We have delay implemented, but we still need to figure out what the correct thing to do on grudual and barrier in the model.

The way you handle this is to alter lookup_fuse (an internal command in the model) such that it looks at the current fuse state and hands back the current fuse state to the system. The fuse state is then coded on being part of a data type:

  • OK
  • BLOWN
  • {GRADUAL, Percentile}
  • {BARRIER, Term}

Command processing changes these states accordingly in the fuse, and lookup_fuse consults these states to figure out what is happening.

New finding:

We need to simplify the model first. To do this, we must introduce a new record #fuse{} which tracks the state of a fuse. And then we need to take the current states, disabled and blown and move them into the fuse record itself.

Once this is complete, it is far easier to handle the above scheme, without going mad trying to do so. Also, the simplification will make it easier to extend the system later on.

Managed to simplify half of the model now. Still need to simplify disabled and melt states into fuses.

Disabled has now been folded into the fuse state. Still need to work on melt.

hi @jlouis , thank you for this very useful library. I was wondering if the half-open state of circuit breaker eventually got implemented in fuse. I see one or two references to gradual in code such as this. I don't see this documented in the API reference or tutorial though. Can you please confirm if half-open state is supported by fuse? Thanks.

@ahmadferdous unfortunately it's not there, yet. There are some test-code scaffolding in place to make sure it will work, but the code itself doesn't really support this notion as of now. It's one of those things I've been interested in doing at some point, but I got distracted with other stuff for a couple of years, heh.

Necromancy! Hitting this with a Necrobolt of work :)

The model has been brought up-to-date, and we are now processing "standard" fuses as a command list. In particular delay has been implemented. The next part is to implement the gradual command I think.

Great news