Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error rate circuit breaker #264

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

RyanAD
Copy link

@RyanAD RyanAD commented Apr 2, 2020

This PR adds a circuit breaker that opens when the error percentage crosses some threshold (ex, 25% of calls failed in the last 10 seconds), also discussed here: #245. Ideally trying to solve for the problem mentioned in this comment: #245 (comment) . We have unpredictable spikes in traffic, which make tuning an absolute # of errors in a time window difficult.

There are some methods that technically violate the DRY principle, in particular the acquire and maybe_with_half_open_resource_timeout methods on lib/semian/error_rate_circuit_breaker.rb and lib/semian/circuit_breaker.rb are the same. I'm open to suggestions on if this should be fixed, and the best way to do so.

This has not been battle-tested in a high volume production environment yet, and I'm still working on documentation. Also working on using Timecop in the new unit tests.

@ghost ghost added the cla-needed label Apr 2, 2020
@RyanAD RyanAD force-pushed the error-rate-circuit-breaker branch from f17bd19 to bad8a75 Compare April 3, 2020 17:18
@ghost ghost removed the cla-needed label Apr 3, 2020
@damianthe damianthe self-requested a review April 7, 2020 14:26
Copy link
Contributor

@damianthe damianthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.


def maybe_with_half_open_resource_timeout(resource, &block)
result =
if half_open? && @half_open_resource_timeout && resource.respond_to?(:with_resource_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use two spaces for indentation on this block (rather than 4)

lib/semian/error_rate_circuit_breaker.rb Outdated Show resolved Hide resolved
lib/semian/error_rate_circuit_breaker.rb Show resolved Hide resolved
lib/semian/error_rate_circuit_breaker.rb Show resolved Hide resolved
end

def error_threshold_reached?
return false if @window.empty? or @window.length < @request_volume_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_volume_threshold is an interesting parameter because its desired value varies depending on the volume of requests.

If the volume of requests happening in window_size ever drops bellow request_volume_threshold, the circuit will never open.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@damianthe damianthe Apr 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I can see how this is useful now. If you configure it to be above time_window / (timeout * request_volume_threshold) then it acts as a damper for the problem I was describing in #264 (comment)

@RyanAD
Copy link
Author

RyanAD commented Apr 9, 2020

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

I was talking about this with a colleague, and we both agreed it would be better to specifically specify which circuit breaker implementation you want instead of determining based on the options passed in. I plan on making that change. I'm currently working on other things at the moment, but should have time to cycle back on this in the next week.

@RyanAD
Copy link
Author

RyanAD commented Apr 24, 2020

I really like the concept of this PR at a high level. It would be amazing to have a more intuitive way to configure the circuit breaker.

However, I don't think that this change has actually makes it more intuitive (see comments below). I'm open to alternative approaches or suggestions. Please let me know what you think.

Thank you for the suggestions. I just pushed an implementation that calculates the time spent in error vs success. Please let me know your thoughts.

Copy link
Contributor

@damianthe damianthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice work. I like the direction this is heading. Just a few minor comments

lib/semian.rb Outdated Show resolved Hide resolved
lib/semian/error_rate_circuit_breaker.rb Show resolved Hide resolved
lib/semian/error_rate_circuit_breaker.rb Outdated Show resolved Hide resolved
lib/semian/error_rate_circuit_breaker.rb Show resolved Hide resolved
lib/semian/time_sliding_window.rb Outdated Show resolved Hide resolved
lib/semian/time_sliding_window.rb Outdated Show resolved Hide resolved
lib/semian/time_sliding_window.rb Show resolved Hide resolved
test/time_sliding_window_test.rb Show resolved Hide resolved
Copy link
Contributor

@damianthe damianthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for adding this feature! Feel free to keep me posted on how it works out for you in a live production environment. 👍

@miry
Copy link
Contributor

miry commented Jun 10, 2022

Hi @RyanAD any progress on this one?

@RyanAD
Copy link
Author

RyanAD commented Jun 10, 2022

We've been successfully using this within Instacart for a while now. Let me confirm we don't have any changes that should be incorporated here. I'll follow up by the end of next week.

@RyanAD
Copy link
Author

RyanAD commented Jun 22, 2022

I pushed two commits for code cleanup and a small thread safety fix in TimeSlidingWindow. This code has been running successfully within Instacart for several months. What are next steps for merging?

@miry
Copy link
Contributor

miry commented Jun 23, 2022

@RyanAD next steps:

  • Add changelog line
  • Rebase and squash against master branch

@miry miry added the Semian label Jun 23, 2022
@nap
Copy link

nap commented Dec 5, 2022

@miry is this still valid? What's missing, beside the conflict, to push this out?

end

def disabled?
ENV['SEMIAN_CIRCUIT_BREAKER_DISABLED'] || ENV['SEMIAN_DISABLED']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requires rethinking of changes in master branch. Idea to not use ENV inside business logic, only during the configuration phase.

success_threshold: 2,
half_open_resource_timeout: nil,
time_source: -> {Time.now.to_f * 1000})
Timecop.travel(-1.1) do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timecop does not work well with Monotonic clocks.
Replaced it with custom made helper to travel in time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants