-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control the service restart after blown #9
Comments
Yes, this is a good idea, which I've also considered implementing. The problem with its implementation is "how are you going to build a quickcheck model for it?". You need to come up with a good way of describing what gradually become ok means, and hopefully in a "deterministic" way. One of doing so is to control the RNG from the model so you can decide what the outcome of RNG lookups are. The other problem is how you are going to let a few through. The fuse is an ETS table lookup, so if you flip that to 'ok' then the system will almost surely let a few through. So you would need some kind of "{gradual, Pct}" for some percentage, with the RNG controlled by the model. This, and also its cousin of manually being able to disable/reenable fuses, are probably two of the most needed features. If you come up with a better scheme, I can try to figure out if I can build a QC model for that. |
Ok, this is doable if we just control the RNG in the test cases, which is fairly easy. What do you think the configuration should look like? I think there are a number of things here:
I'm pretty sure I could build a quickcheck model for this kind of system, since I can mock the RNG and control its outcome, so I can say what the system should do in the different cases. I could also improve the timing mocking for this. |
More thoughts:
|
Some implementation plan for a QC model:
A first implementation should probably support a new type of fuse Once you have support for this, it should be easy to add gradual ramping to the system. The price to pay are parallel invocation models for this change, as they cannot be handled by such a system. So we would have to keep a parallel model around separately for this. |
Hi, I am not familiar with QuickCheck but now I have a good chance to learn about it, as soon as I have a model designed and something to show I will let you know. |
We already have most of the model in |
Yeah, we've discussed something similarly w/ our use of Fuse to handle Solr (and other third-party systems) issues (w/ solr_cores) under load. Being able to gradually pass from blown->ok would be a better model of how we expect our fuse-wrapped operations to eventually resolve. I'd be down for reviewing and/or helping w/ QC if there are questions too when I'm back around next week. |
One important observation is that a standard fuse with a reset of |
@jlouis yep... that observation makes 100% sense to me :). |
Ok, #10 has a new
|
The model has been taught about installing and handling fuses of Looking forward:
|
We can have a also another approach, instead of adding delay to the "ok" state we can fail even faster if we are in a "gradual interval":
With this approach in the case of the backend service recover well, we do not loose requests, but if the service starts to fail again we will have the chance to fail fast/sooner and back off for a some short period of time (depending on the fail rate). If the period between fails is small in the This fuse could be a I hope this idea is clear enough :) |
I think it would make sense that in a "gradual" setting, we immediately fall back to error if it fails. I also think we can implement this with an Perhaps with a bit more thinking, it is possible to figure out how this fuse type can be added to the system. |
The reset policy is a command language. You give commands
The standard
|
Hey, @jlouis with the concept of having a Nice suggestion!!! This reset sequence could be implemented with a gen_fsm? |
The way to implement this is to first support a simpler variant, namely a reset policy |
First, we need to update the model. We need to stop tracking the But by removing the |
The model update is #11 and it vastly simplifies the model. The next step is to add a tracking in the model of a fuse being in the
|
It turns out we cannot use #11, so it is back to the drawing board, probably by accepting the complexity of the model and then adding the |
The
This is possible to model all of these considerations, but it gets quite nasty since we will need |
New idea, inspired by @lehoff in a loose way: The thing that is hard with a But if we supported
In turn, we can now support any model you can think up outside the scope of the fuse system itself. A |
With this model, the caller process has to call Now we have |
An update: I added timing to the EQC model in #12 which has uncovered some bugs in I'll probably build a point-release, but I'm not sure it will fly on release 16 yet. Backwards compatibility should be fairly easy though since there should be a time-compat module and a rand-compat module for handling the backporting. The rest of the code should be R16 safe, I think. The problem is somewhat benign: if a fuse is melted too much just as it blows, then more than a single timer is set on the fuse. This can lead to fun situations when the timer clears again, but I don't think it will. However, that world is somewhat undefined behavior :) If you want me to track the state explicitly for, say, Basho, just open an issue on this repo. |
On this issue though: #12 implements the necessary timing scaffolding which eventually lets us model the proposal in this issue. It is a prerequisite step since it puts timing under the wings of the model and we now control time explicitly in a EQC component based cluster. |
With #12 implemented, we can start modeling the real code for the system. This comment describes what is needed: First, we must introduce a notion in the model of a command list. Given a fuse, it's reset policy is given as a list of commands, and we have a "next" command which explains the current state of the fuse. If we have, say, When the fuse is blown, we start processing this list. We introduce an internal call to "process commands" which then places the model in the correct state. We can implement this in the model without altering the SUT, and we can make it "backwards compatible". Once this is in place, we have the necessary stepping stone to implement the remainder of commands.
|
We have the first part of a command processor for this issue. It is implemented in #14 and is going into the model soon. |
We have |
The way you handle this is to alter
Command processing changes these states accordingly in the fuse, and |
New finding: We need to simplify the model first. To do this, we must introduce a new record Once this is complete, it is far easier to handle the above scheme, without going mad trying to do so. Also, the simplification will make it easier to extend the system later on. |
Managed to simplify half of the model now. Still need to simplify |
Disabled has now been folded into the fuse state. Still need to work on |
hi @jlouis , thank you for this very useful library. I was wondering if the half-open state of circuit breaker eventually got implemented in fuse. I see one or two references to |
@ahmadferdous unfortunately it's not there, yet. There are some test-code scaffolding in place to make sure it will work, but the code itself doesn't really support this notion as of now. It's one of those things I've been interested in doing at some point, but I got distracted with other stuff for a couple of years, heh. |
Necromancy! Hitting this with a Necrobolt of work :) The model has been brought up-to-date, and we are now processing "standard" fuses as a command list. In particular |
Great news |
Hi,
I am using fuse to control the access to a backend service (DAL API). It would be nice that we have some good way of passing from blown to ok gradually. Otherwise, if the backend is under load (502/503), for example, the requests will be back on charge all at the same time, after the "heal" interval and can cause problems again.
Thank you,
(I will be able to implement a solution if you think that may be useful)
Pedro
The text was updated successfully, but these errors were encountered: