This document aims to provide a convenient overview of all resilience patterns that are supported by Riptide. A complete example configuration can be found here.
Isolate elements of an application into pools so that if one fails, the others will continue to function. This pattern is named Bulkhead because it resembles the sectioned partitions of a ship's hull. If the hull of a ship is compromised, only the damaged section fills with water, which prevents the ship from sinking.
Riptide supports the bulkhead pattern by supporting isolated, fixed-size thread pools and bounded work queues per client/instance.
riptide.clients:
example:
thread-pool:
min-size: 4
max-size: 16
keep-alive: 1 minute
queue-size: 0
[..] transient faults, such as slow network connections, timeouts, or the resources being overcommitted or temporarily unavailable [..]
The riptide-faults module provides a set TransientFaults
predicates that detects transient faults:
Http.builder().requestFactory(new HttpComponentsClientHttpRequestFactory())
.plugin(new FailsafePlugin()
.withPolicy(new RetryRequestPolicy(
RetryPolicy.<ClientHttpResponse>builder()
.handleIf(CheckedPredicateConverter.toCheckedPredicate(transientSocketFaults()))
.build())
));
riptide.clients:
example:
transient-fault-detection.enabled: true
Enable an application to handle transient failures when it tries to connect to a service or network resource, by transparently retrying a failed operation. This can improve the stability of the application.
Provided by riptide-failsafe
riptide.clients:
example:
retry:
fixed-delay: 50 milliseconds
max-retries: 5
max-duration: 2 second
jitter: 25 milliseconds
Handle faults that might take a variable amount of time to recover from, when connecting to a remote service or resource. This can improve the stability and resiliency of an application.
Provided by riptide-failsafe
riptide.clients:
example:
circuit-breaker:
failure-threshold: 3 out of 5
delay: 30 seconds
success-threshold: 5 out of 5
A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. [..] defer sending a secondary request until the first request has been outstanding for more than the 95th-percentile expected latency for this class of requests
Provided as BackupRequest
policy in riptide-failsafe
riptide.clients:
example:
backup-request:
delay: 75 milliseconds
Provided by:
- riptide-core
CompletableFuture.exceptionally(Function)
- faux-pas
Capture<User> capture = Capture.empty();
http.get("/me")
.dispatch(series(),
on(SUCCESSFUL).call(User.class, capture),
on(CLIENT_ERROR).dispatch(status(),
on(UNAUTHORIZED).call(() ->
capture.accept(new Anonymous()))))
.thenApply(capture)
.exceptionally(e -> new Unknown());
Some resilience patterns like retries, queuing, backup requests and fallbacks introduce delays. Configuring connect and
socket timeouts in those cases is often not enough. You need consider maximum number of retries, exponential backoff
delays, jitter, etc. An easy way is to set a global timeout that spans all of the things mentioned before. This
feature is provided by riptide-failsafe and Failsafe's Timeout
policy.
Given the following sample configuration:
riptide.clients:
example:
connect-timeout: 50 milliseconds
socket-timeout: 25 milliseconds
retry:
fixed-delay: 30 milliseconds
max-retries: 5
jitter: 15 milliseconds
backup-request:
delay: 75 milliseconds
timeout: 500 milliseconds
Based on the absolute worst-case scenario, one would need to expect:
50ms + 25ms + 75ms + 5 × (50ms + 25ms + 30ms + 15ms) = 750ms
This calculation does not factor in delays because of queuing. Neither does it consider the processing time to process the response, e.g. deserialization.
Let's assume there is a budget of 500 ms that can be spend on this remote call. The calculation above shows that the worst-case scenario would extend way beyond that given budget. But it's very unlikely that you'll hit the absolute worst case. If a connection can be established it won't take exactly 50 ms every time. The jitter will vary in how long the delay will be, somewhere between 15 and 45 ms and 30 ms on average. For example, let's assume you connect within 1 ms, but hit the socket timeout 5 times in a row:
1ms + 25ms + 75ms + 5 × (1ms + 25ms + 30ms) = 381ms
Your budget allows for this, so there is no reason not to use it. Setting a timeout allows you to maximize your budget and minimize the risk of exceeding it.