Task Circuit Breaker
For the latest version: see here!
The TaskCircuitBreaker
is used to provide stability and prevent
cascading failures in distributed systems.
Purpose #
As an example, we have a web application interacting with a remote third party web service. Let’s say the third party has oversold their capacity and their database melts down under load. Assume that the database fails in such a way that it takes a very long time to hand back an error to the third party web service. This in turn makes calls fail after a long period of time. Back to our web application, the users have noticed that their form submissions take much longer seeming to hang. Well the users do what they know to do which is use the refresh button, adding more requests to their already running requests. This eventually causes the failure of the web application due to resource exhaustion. This will affect all users, even those who are not using functionality dependent on this third party web service.
Introducing circuit breakers on the web service call would cause the requests to begin to fail-fast, letting the user know that something is wrong and that they need not refresh their request. This also confines the failure behavior to only those users that are using functionality dependent on the third party, other users are no longer affected as there is no resource exhaustion. Circuit breakers can also allow savvy developers to mark portions of the site that use the functionality unavailable, or perhaps show some cached content as appropriate while the breaker is open.
How it Works #
The circuit breaker models a concurrent state machine that can be in any of these 3 states:
Closed
: During normal operations or when theTaskCircuitBreaker
starts- Exceptions increment the
failures
counter - Successes reset the
failures
counter to zero - When the
failures
counter reaches themaxFailures
threshold, the breaker is tripped into theOpen
state
- Exceptions increment the
Open
: The circuit breaker rejects all tasks- all tasks fail fast with
ExecutionRejectedException
- after the configured
resetTimeout
, the circuit breaker enters aHalfOpen
state, allowing one task to go through for testing the connection
- all tasks fail fast with
HalfOpen
: The circuit breaker has already allowed a task to go through, as a reset attempt, in order to test the connection- The first task when
Open
has expired is allowed through without failing fast, just before the circuit breaker is evolved into theHalfOpen
state - All tasks attempted in
HalfOpen
fail-fast with an exception just as in theOpen
state - If that task attempt succeeds, the breaker is reset back to the
Closed
state, with theresetTimeout
and thefailures
count also reset to initial values - If the task attempt fails, the breaker is tripped again into the
Open
state (theresetTimeout
is multiplied by the exponential backoff factor, up to the configuredmaxResetTimeout
)
- The first task when
(image credits go to Akka’s documentation)
Usage #
import monix.eval._
import scala.concurrent.duration._
val circuitBreaker = TaskCircuitBreaker(
maxFailures = 5,
resetTimeout = 10.seconds
)
//...
val problematic = Task {
val nr = util.Random.nextInt()
if (nr % 2 == 0) nr else
throw new RuntimeException("dummy")
}
val task = circuitBreaker.protect(problematic)
When attempting to close the circuit breaker and resume normal operations, we can also apply an exponential backoff for repeated failed attempts, like so:
val circuitBreaker = TaskCircuitBreaker(
maxFailures = 5,
resetTimeout = 10.seconds,
exponentialBackoffFactor = 2,
maxResetTimeout = 10.minutes
)
In this sample we attempt to reconnect after 10 seconds, then after 20, 40 and so on, a delay that keeps increasing up to a configurable maximum of 10 minutes.
Credits #
This data type was inspired by the availability of Akka’s Circuit Breaker. The implementation and the API are not the same, but the purpose and the state machine it uses is similar.
This documentation also has copy/pasted fragments from Akka. Credit should be given where credit is due 😉