diff --git a/docs/conf.py b/docs/conf.py index a3fccb8c5c5..105906d9d4e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -24,6 +24,12 @@ smartquotes = False nitpicky = True +# -- Markdown configuration ----------------------------------------------- + +myst_enable_extensions = [ + "colon_fence" +] + # -- Options for HTML output ---------------------------------------------- html_theme = "furo" diff --git a/docs/source-2.0/guides/client-guidance/index.md b/docs/source-2.0/guides/client-guidance/index.md index 35f8f914fde..d00242125ad 100644 --- a/docs/source-2.0/guides/client-guidance/index.md +++ b/docs/source-2.0/guides/client-guidance/index.md @@ -59,4 +59,5 @@ Smithy clients should follow these tenets: application-protocols/index context +retries ``` diff --git a/docs/source-2.0/guides/client-guidance/retries.md b/docs/source-2.0/guides/client-guidance/retries.md new file mode 100644 index 00000000000..6242943baa8 --- /dev/null +++ b/docs/source-2.0/guides/client-guidance/retries.md @@ -0,0 +1,573 @@ +# Retrying Requests + +Operation requests might fail for a number of reasons that are unrelated to the +input parameters, such as a transient network issue or excessive load on the +service. This guide gives recommendations on how Smithy clients can +automatically retry failed requests in those cases, and how a robust system for +pluggable retry strategies can be implemented. + +## Why is a retry system recommended? + +If transient failures are surfaced to Smithy client users, they will likely +implement their own retry behavior. This hand-written retry behavior may not +include important risk mitigations that reduce the impact of outages. + +When a service begins to degrade, clients may notice behavior changes, such as +an elevated number of server errors or connection failures. Users will naturally +want to retry these failed requests. Unfortunately, these retries have the +effect of increasing the load on the service, which may cause the service to +further degrade, resulting in even more retries. + +These sorts of events are called **retry storms**, and are often the result of +poorly managed retry behavior. While there is no perfect retry behavior, +strategies will inevitably improve over time as the scale of systems grows and +new cascading failure conditions are observed. The right interface reflecting +the problem domain can make sure the right extension points are available for +future expansion. + +## Retry interfaces + +It is recommended to expose retry interfaces that aren't coupled to a particular +implementation or protocol. It is recommended to have a `RetryStrategy` that is +isolated from the state of individual requests alongside `RetryToken`s to +capture that state and pass it between attempts. + +### Retry token + +A `RetryToken` is a bundle of state that is created and passed between attempts +of a single request. It should indicate how long to wait until the next attempt, +but should allow each implementation to include whatever state they find +necessary. This could include the number of attempts that have been made, an +identifier for the request, or anything else that is necessary for the retry +strategy. + +```java +public interface RetryToken { + /** + * @return the duration to wait until the next attempt is made. + */ + Duration delay(); +} +``` + +### Retry strategy + +A `RetryStrategy` is where the logic of computing delays and determining if a +request should be retried lives. It encapsulates the state of a request in retry +tokens, which it creates and refreshes. + +```java +public interface RetryStrategy { + /** + * Invoked before the first request attempt. + * + *
An initial request will be made even if a token cannot be acquired. + * It is recommended to always return a token so that the retry strategy + * may continue to track each request and be informed when they succeed or + * fail. + * + * @throws TokenAcquisitionFailedException if a token cannot be acquired. + */ + RetryToken acquireInitialToken(); + + /** + * Invoked before each subsequent (non-first) request attempt. + * + * @throws IllegalArgumentException if the provided token was not issued by + * this strategy or the provided token was already used for a previous + * refresh or success call. + * + * @throws TokenAcquisitionFailedException if a token cannot be acquired. + */ + RetryToken refreshRetryToken(RetryToken token, Throwable failure); + + /** + * Invoked after an attempt succeeds. + * + * @throws IllegalArgumentException if the provided token was not issued by + * this strategy or the provided token was already used for a previous + * refresh or success call. + */ + void recordSuccess(RetryToken token); +} +``` + +:::{note} + +While the state of a request is intended to be included in the retry token, a +retry strategy may still need to manage some state that is shared across the +client. Be careful to ensure that access to that state is synchronized in order +to prevent race conditions. +::: + +#### Using retry strategies + +An initial retry token should be acquired at the beginning of a request, before +the first attempt is made. If an initial token cannot be acquired, the client +should still make an attempt. This initial attempt is always made because the +purpose of the retry strategy is to manage retries, not to gate access to a +service entirely. + +If an attempt fails, the retry strategy is passed the retry token for the +attempt and given the exception raised by the attempt. If the retry strategy +determines that the failed attempt may be retried, it will return a refreshed +retry token to use for the next attempt. If the retry strategy determines that +the failed attempt may not be retried, it throws an exception. + +If the request succeeds, the retry strategy is given the token so that it may +free up any resources it was using. + +The following is a simplified example of what it looks like to use the +`RetryStrategy` interface to implement a retryable request loop. + +```java +/** + * A simplified example of what a retryable request loop looks like. + * + * @param serializedRequest a request that has been fully serialized and is + * ready to send. + * + * @return a successful result. + */ +public Result request(SerializedRequest serializedRequest) { + // First acquire the initial retry token. If a token cannot be acquired, + // make only one attempt without retries. + RetryToken retryToken; + try { + retryToken = this.retryStrategy.acquireInitialToken(); + } catch (TokenAcquisitionFailedException e) { + return send(serializedRequest); + } + + // Make attempts until the request succeeds or the retry strategy throws + // an exception. Notably, each retry strategy is responsible for controlling + // the maximum number of attempts. + while (true) { + + // Wait for the indicated delay duration. Even the initial token may + // include a delay. + if (retryToken.delay() != null) { + Thread.sleep(retryToken.delay()); + } + + Result result = null; + try { + result = send(serializedRequest); + } catch (Exception e) { + // If the request fails, attempt to refresh the retry token. + try { + retryToken = this.retryStrategy.refreshRetryToken(retryToken, e); + } catch (TokenAcquisitionFailedException retryError) { + // If the token can't be refreshed, the request fails, so the + // original exception needs to be propagated. Logging the reason + // the retry failed is advisable. + throw e; + } + } + + // If the result was successful, inform the retry strategy. This allows + // it to free up any resources if necessary. + if (result != null) { + this.retryStrategy.recordSuccess(retryToken); + return result; + } + } +} +``` + +### Retryable errors + +Request attempts can throw many different types of exceptions and it is not +reasonable to expect retry strategy implementations to be aware of them all. For +example, different HTTP clients may expose response status codes in different +ways. Beyond that, a Smithy client may not even be using HTTP as a transport. It +is therefore recommended to use an interface to standardize the information that +is relevant to retry strategies. Retry strategies can then use that information +if it is available while still attempting to handle exceptions that don't have +that information available. + +In particular, exceptions should indicate: + +* Whether they are safe to retry. For example, a failure to connect to the + service or a temporary service error may be retryable while an error + relating to invalid input or authentication failure is not retryable. +* Whether they are a throttling error. That is, whether they are an error + returned by a service specifically to indicate that too many requests have + been made recently. For HTTP protocols, for example, a `429` status code + indicates this. +* Whether they are a timeout error. That is, whether the error is a result of + the service not responding within the transport client's defined timeout + limit. For HTTP protocols, a `504` status code could also indicate this. +* A minimum time to wait until the next request, if that information is + available. For HTTP protocols, for example, this could be indicated by the + [`Retry-After`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After) + header. A retry strategy may choose a delay that is longer than this, but + should not choose a delay that is shorter than this since it is information + directly from the service. + +```java +/** + * Provides retry-specific information about an error. + */ +public interface RetryInfo { + /** + * Get the decision about whether it's safe to retry the encountered error. + * + *
If the decision is {@link RetrySafety#YES}, it does not mean that a + * retry will occur, but rather that a retry is allowed to occur. + * + * @return whether it's safe to retry. + */ + RetrySafety isRetrySafe(); + + /** + * Check if the error is a throttling error. + * + * @return the error type. + */ + default boolean isThrottle() { + return false; + } + + /** + * Check if the error is a timeout error. + * + * @return the error type. + */ + default boolean isTimeout() { + return false; + } + + /** + * Get the amount of time to wait before retrying. + * + * @return the time to wait before retrying, or null if no hint for a + * retry-after was detected. + */ + default Duration retryAfter() { + return null; + } + + /** + * Whether it's safe to retry. + */ + public enum RetrySafety { + /** + * Yes, it is safe to retry this error. + */ + YES, + + /** + * No, a retry should not be made because it isn't safe to retry. + */ + NO, + + /** + * Not enough information is available to determine if a retry is safe. + */ + MAYBE + } +} +``` + +Additionally, Smithy's [`error` trait](#error-trait) indicates whether an error +is the fault of the client or the server. It is highly recommended to include +this information in code-generated exception classes. It is also recommended to +allow this information to be provided for other kinds of exceptions. For +example, a 400-level status code in an HTTP response indicates a client error, +while a 500-level status code indicates a server error. This is shown below as a +separate interface since it is relevant to more than just retries. + +```java +public interface ErrorInfo { + /** + * Indicates if a client or server is at fault for the error, if known. + */ + ErrorFault fault(); + + enum ErrorFault { + /** + * The client is at fault for this error (e.g., it omitted a required + * parameter or sent an invalid request). + */ + CLIENT, + + /** + * The server is at fault (e.g., it was unable to connect to a + * database, or other unexpected errors occurred). + */ + SERVER, + + /** + * The fault isn't necessarily client or server. + */ + OTHER; + } +} +``` + +Errors that are defined in the Smithy model should have all the properties of +`RetryInfo` statically generated or settable. For example, the following +demonstrates a modeled error and what it might look like as a generated Java +exception class. + +```smithy +@httpError(429) +@error("client") +@retryable(throttling: true) +structure ThrottlingError { + message: String +} +``` + +```java +public class RetryAfterException extends RuntimeException implements ErrorInfo, RetryInfo { + private final String message; + + private Duration retryAfter = null; + + public RetryAfterException(String message) { + super(message); + this.message = message; + } + + // This value is code generated based on the error trait. + public ErrorFault fault() { + return ErrorFault.CLIENT; + } + + // This value is code generated based on the retryable trait. If that trait + // is not present, this should be NO for "client" errors or MAYBE for + // "server" errors. + public RetrySafety isRetrySafe() { + return RetrySafety.YES; + } + + // This value is code generated based on the retryable trait. + public boolean isThrottle() { + return true; + } + + public Duration retryAfter() { + return this.retryAfter; + } + + // This would be called by the deserializer. For HTTP protocols, for + // instance, it would be called if a `retry-after` header is present in the + // response. + public void setRetryAfter(Duration duration) { + this.retryAfter = duration; + } + + // This is the modeled message property. + public String message() { + return this.message; + } +} +``` + +## Retry behaviors + +The most basic retry behavior is a simple loop with no delays between attempts. +This is the most likely behavior to contribute to retry storms, but a simple +delay between attempts can be just as bad because it can result in spikes of +requests from the same system. + +Instead of a fixed delay, using **exponential backoff** to produce delays that +are exponentially longer each time balances the desire to get a quick success +with the desire to give the service more time to recover. In particular, making +the backoff exponential instead of linear (for example, 1->2->4->8 instead of +1->2->3->4) results in the first few attempts still happening relatively +quickly. This means that temporary issues with the network won't delay requests +much. If attempts keep failing, the exponentially increasing backoff gives a +struggling service more time to recover. + +Exponential backoff doesn't solve all problems. If a large number of clients +make a request at the same time, the service might struggle to respond to any of +them. If they all retry at the same time, this might make the problem worse. +Exponential backoff does not prevent this since all the clients will be using +it, and so the problem will re-occur. To solve this problem, we add a random +factor to the delay known as **jitter**. This results in a smoother load, +preventing another flood of requests and making it easier for the service to +recover. + +The combination of these strategies is known as **exponential backoff with +jitter**, and is relatively common. + +More advanced retry behavior may be implemented by using a +[token bucket](https://en.wikipedia.org/wiki/Token_bucket) on top of exponential +backoff with jitter to dynamically adjust retry behavior in response to changing +service conditions. + +In short: if an attempt fails, a retry is only performed if a whole token can be +removed from the bucket. If an attempt succeeds, a fraction of a token is put +into the bucket. When the bucket is empty, no retries will be performed. This +means that the total number of requests to a service will drop significantly as +the service degrades, improving its ability to recover. The bucket fills back up +as the service failure rate goes down, once again allowing retries to be +performed. In AWS SDKs, this is how the `standard` retry mode works. + +These are only a few possible retry behaviors a client may have, but they +demonstrate some of the potential needs of a retry system. + +### Example retry strategy + +The following is an example retry strategy that implements exponential backoff +with jitter alongside a token bucket. This strategy adds extra cost for timeout +errors since they may indicate a more degraded service. + +Aside from delay, the retry token also tracks the number of attempts that have +been made. This is necessary because this strategy imposes a maximum attempt +count, and also because the delay is calculated in part based on how many +attempts have been made. + +```java +public record AwsStandardRetryToken(int attempts, Duration delay) implements RetryToken { +} +``` + +```java +public final class AwsStandardRetryStrategy implements RetryStrategy { + // These values are not prescriptive. They are static in this example for the + // sake of simplicity, but making them configurable is ideal. + private static final int RETRY_COST = 5; + private static final int TIMEOUT_COST = 10; + private static final int SUCCESS_REFUND = 1; + + private static final int MAX_ATTEMPTS = 5; + private static final int MAX_BACKOFF = 20; + private static final int MAX_CAPACITY = 500; + + // The token bucket is integrated into this retry strategy in this example, + // but in a real client it may be better to have it be its own type so + // that it can be shared, and so that managing concurrency is simpler. + private int tokens = MAX_CAPACITY; + + // Be careful to consider concurrency when designing retry strategies. + // When there are multiple threads accessing the token bucket, proper + // synchronization is essential to prevent race conditions. + private final Object tokensLock = new Object(); + + @Override + public RetryToken acquireInitialToken() { + // This returns successfully even if the token bucket is empty. This is + // because an initial attempt will always be performed anyway, and + // returning successfully here will ensure that the retry strategy is + // checked if that initial attempt fails. By that point, the token bucket + // may no longer be empty. + return new AwsStandardRetryToken(0, null); + } + + @Override + public RetryToken refreshRetryToken(RetryToken token, Throwable failure) { + // First, ensure that the provided token is of the correct type. + if (!(token instanceof AwsStandardRetryToken standardToken)) { + throw new IllegalArgumentException("Invalid token provided for refresh."); + } + + // Next, check to see if the maximum number of attempts has already + // been exceeded. + if (standardToken.attempts >= MAX_ATTEMPTS) { + throw new TokenAcquisitionFailedException("Max attempts exhausted."); + } + + // Examine the exception thrown by the operation to determine an + // appropriate delay, if any. + return switch (failure) { + // If the exception thrown by the operation includes retryability + // information, use that to inform retry behavior. + case RetryInfo retryInfo when retryInfo.isRetrySafe() != RetrySafety.NO -> { + // Attempt to consume tokens from the token bucket to "pay" + // for the retry. + consumeTokens(retryInfo.isTimeout()); + yield backoff(standardToken, retryInfo.retryAfter()); + } + + // If the exception does not have retry info, but does have more + // general error info, that can also be used. This assumes that + // a server error is likely retryable and that a client error + // likely is not. + case ErrorInfo errorInfo when errorInfo.fault() == ErrorFault.SERVER -> { + consumeTokens(false); + yield backoff(standardToken); + } + default -> throw new TokenAcquisitionFailedException("Exception not retryable."); + }; + } + + /** + * Consumes tokens to "pay" for a retry. + * + * @param isTimeout whether the retry is in response to a timeout error, + * which will require more tokens. + * + * @throws TokenAcquisitionFailedException if there are not enough tokens + * in the bucket to pay for the retry. + */ + private void consumeTokens(boolean isTimeout) { + synchronized (tokensLock) { + int cost = isTimeout ? TIMEOUT_COST : RETRY_COST; + + if (this.tokens < cost) { + throw new TokenAcquisitionFailedException("Token bucket exhausted."); + } + + this.tokens -= cost; + } + } + + /** + * Computes a backoff with exponential backoff and jitter, capped at 20 seconds. + * + * @param token the previous token. + */ + private AwsStandardRetryToken backoff(AwsStandardRetryToken token) { + return new AwsStandardRetryToken(token.attempts + 1, computeDelay(token.attempts)); + } + + /** + * Computes a backoff with exponential backoff and jitter, capped at 20 seconds. + * + * @param token the previous token. + * @param suggested the delay suggested by the service, which will serve as + * the minimum delay. + */ + private AwsStandardRetryToken backoff(AwsStandardRetryToken token, Duration suggested) { + // Compute the backoff as normal. If it is longer than the suggested + // backoff from the service, use it. Otherwise, use the suggested + // backoff. + Duration computedDelay = computeDelay(token.attempts); + Duration finalDelay = computedDelay.toMillis() < suggested.toMillis() ? suggested : computedDelay; + return new AwsStandardRetryToken(token.attempts + 1, finalDelay); + } + + /** + * Computes the delay with exponential backoff and jitter, capped at 20 seconds. + * + * @param attempts the number of attempts made so far. + * @return the computed delay duration. + */ + private Duration computeDelay(int attempts) { + // First compute the exponential backoff. + double backoff = Math.pow(2, attempts); + + // Next, cap it at 20 seconds. + backoff = Math.min(backoff, MAX_BACKOFF); + + // Finally, add jitter and expand to milliseconds. + double backoffMillis = Math.random() * backoff * 1000; + return Duration.ofMilliseconds((long) backoffMillis); + } + + @Override + public void recordSuccess(RetryToken token) { + synchronized (tokensLock) { + // When a successful request is made, refill the token bucket unless it + // is already at maximum capacity. + if (this.tokens <= MAX_CAPACITY - SUCCESS_REFUND) { + this.tokens += SUCCESS_REFUND; + } + } + } +} +```