From b23a9aa2c185e6b51954939433395a1cbc1979a2 Mon Sep 17 00:00:00 2001 From: JordonPhillips Date: Mon, 2 Feb 2026 20:28:49 +0100 Subject: [PATCH 1/3] Add client retry guidance --- .../guides/client-guidance/retries.md | 353 ++++++++++++++++++ 1 file changed, 353 insertions(+) create mode 100644 docs/source-2.0/guides/client-guidance/retries.md diff --git a/docs/source-2.0/guides/client-guidance/retries.md b/docs/source-2.0/guides/client-guidance/retries.md new file mode 100644 index 00000000000..cf2f6449aaf --- /dev/null +++ b/docs/source-2.0/guides/client-guidance/retries.md @@ -0,0 +1,353 @@ +# Retrying Requests + +Operation requests might fail for a number of reasons that are unrelated to the +input parameters, such as a transient network issue or excessive load on the +service. This guide gives recommendations on how Smithy clients can +automatically retry failed requests in those cases, and how a robust system for +pluggable retry strategies can be implemented. + +## Why is a retry system recommended? + +If transient failures are surfaced to Smithy client users, they will likely +implement their own retry behavior. This hand-written retry behavior may not +include important risk mitigations that reduce the impact of outages. + +When a service begins to degrade, clients may notice behavior changes, such as +an elevated number of server errors or connection failures. Users will naturally +want to retry these failed requests. Unfortunately, these retries have the +effect of increasing the load on the service, which may cause the service to +further degrade, resulting in even more retries. + +These sorts of events are called **retry storms**, and are often the result of +poorly managed retry behavior. While there is no perfect retry behavior, +strategies will inevitably improve over time as the scale of systems grows and +new cascading failure conditions are observed. The right interface reflecting +the problem domain can make sure the right extension points are available for +future expansion. + +## Retry behaviors + +The most basic retry behavior is a simple loop with no delays between attempts. +This is the most likely behavior to contribute to retry storms, but a simple +delay between attempts can be just as bad because it can result in spikes of +requests from the same system. + +Instead of a fixed delay, using **exponential backoff** to produce delays that +are longer each time balances the desire to get a quick success with the desire +to give the service more time to recover. Adding some randomness to that delay +(known as **jitter**) can result in a smoother request load. This strategy, +called **exponential backoff with jitter**, is relatively common. In AWS SDKs +this is the `standard` retry mode. + +More advanced retry behavior may be implemented by using a +[token bucket](https://en.wikipedia.org/wiki/Token_bucket) on top of exponential +backoff with jitter to dynamically adjust retry behavior in response to changing +service conditions. In short: if an attempt succeeds, some fraction of a token +is dropped in the bucket. If an attempt fails, a retry is only performed if a +whole token can be removed from the bucket. This results in the total number of +requests to a service dropping as the service degrades, improving its ability to +recover. When the service recovers, the token bucket fills back up and load +returns to normal. In AWS SDKs, this is the `adaptive` retry mode. + +These are only a few possible retry behaviors a client may have, but they +demonstrate some of the potential needs of the retry system. + +## Retry interfaces + +It is recommended to implement retry behavior in a `RetryStrategy` that produces +`RetryToken`s to pass state between attempts. Passing state through tokens in +this way allows the `RetryStrategy` implementation to be isolated from the state +of an individual request. + +### Retry token + +The retry token itself should indicate a delay to wait before the next attempt +is made, but it may otherwise contain any state that is necessary for the retry +strategy. + +```java +public interface RetryToken { + /** + * @return the duration to wait until the next attempt is made. + */ + public Duration delay(); +} +``` + +### Retry strategy + +The `RetryStrategy` creates and refreshes retry tokens. When an attempt fails, +the retry strategy is passed the retry token for the attempt and given the +exception raised by the attempt. If a failed attempt may be retried, the retry +strategy will return a refreshed retry token to use for the next attempt. If a +failed attempt may not be retried, it throws an exception. If a request +succeeds, the retry strategy is given the token so that it may free up any +resources it was using. + +```java +public interface RetryStrategy { + /** + * Invoked before the first request attempt. + * + * @throws TokenAcquisitionFailedException if a token cannot be acquired. + */ + RetryToken acquireInitialToken(); + + /** + * Invoked before each subsequent (non-first) request attempt. + * + * @throws IllegalArgumentException if the provided token was not issued by + * this strategy or the provided token was already used for a previous + * refresh or success call. + * + * @throws TokenAcquisitionFailedException if a token cannot be acquired. + */ + RetryToken refreshRetryToken(RetryToken token, Throwable failure); + + /** + * Invoked after an attempt succeeds. + * + * @throws IllegalArgumentException if the provided token was not issued by + * this strategy or the provided token was already used for a previous + * refresh or success call. + */ + void recordSuccess(RetryToken token); +} +``` + +### Retryable errors + +Request attempts can throw many different types of exceptions and it is not +reasonable to expect retry strategy implementations to be aware of them all. For +example, different HTTP clients may expose response status codes in different +ways. Beyond that, a Smithy client may not even be using HTTP as a transport. It +is therefore recommended to use an interface to standardize the information that +is relevant to retry strategies. Retry strategies can then use that information +if it is available while still attempting to handle exceptions that don't have +that information available. + +In particular, exceptions should indicate: + +* Whether they are safe to retry. For example, a failure to connect to the + service or a temporary service error may be retryable while an error + relating to invalid input or authentication failure is not retryable. +* Whether they are a throttling error. That is, whether they are an error + returned by a service specifically to indicate that too many requests have + been made recently. For HTTP protocols, for example, a `429` status code + indicates this. +* A minimum time to wait until the next request, if that information is + available. For HTTP protocols, for example, this could be indicated by the + [`Retry-After`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After) + header. A retry strategy may choose a delay that is longer than this, but + should not choose a delay that is shorter than this since it is information + directly from the service. + +```java +/** + * Provides retry-specific information about an error. + */ +public interface RetryInfo { + /** + * Get the decision about whether it's safe to retry the encountered error. + * + *

If the decision is {@link RetrySafety#YES}, it does not mean that a + * retry will occur, but rather that a retry is allowed to occur. + * + * @return whether it's safe to retry. + */ + RetrySafety isRetrySafe(); + + /** + * Check if the error is a throttling error. + * + * @return the error type. + */ + boolean isThrottle(); + + /** + * Get the amount of time to wait before retrying. + * + * @return the time to wait before retrying, or null if no hint for a + * retry-after was detected. + */ + Duration retryAfter(); + + /** + * Whether it's safe to retry. + */ + public enum RetrySafety { + /** + * Yes, it is safe to retry this error. + */ + YES, + + /** + * No, a retry should not be made because it isn't safe to retry. + */ + NO, + + /** + * Not enough information is available to determine if a retry is safe. + */ + MAYBE + } +} +``` + +Additionally, Smithy's [`error` trait](#error-trait) indicates whether an error +is the fault of the client or the server. It is highly recommended to include +this information in code-generated exception classes. It is also recommended to +allow this information to be provided for other kinds of exceptions. For +example, a 400-level status code in an HTTP response indicates a client error, +while a 500-level status code indicates a server error. This is shown below as a +separate interface since it is relevant to more than just retries. + +```java +public interface ErrorInfo { + /** + * Indicates if a client or server is at fault for the error, if known. + */ + ErrorFault fault(); + + enum ErrorFault { + /** + * The client is at fault for this error (e.g., it omitted a required + * parameter or sent an invalid request). + */ + CLIENT, + + /** + * The server is at fault (e.g., it was unable to connect to a + * database, or other unexpected errors occurred). + */ + SERVER, + + /** + * The fault isn't necessarily client or server. + */ + OTHER; + } +} +``` + +Errors that are defined in the Smithy model should have all the properties of +`RetryInfo` statically generated or settable. For example, the following +demonstrates a modeled error and what it might look like as a generated Java +exception class. + +```smithy +@httpError(429) +@error("client") +@retryable(throttling: true) +structure ThrottlingError { + message: String +} +``` + +```java +public class RetryAfterException extends RuntimeException implements ErrorInfo, RetryInfo { + private final String message; + + private Duration retryAfter = null; + + public RetryAfterException(String message) { + super(message) + this.message = message; + } + + // This value is code generated based on the error trait. + public ErrorFault fault() { + return ErrorFault.CLIENT; + } + + // This value is code generated based on the retryable trait. If that trait + // is not present, this should be NO for "client" errors or MAYBE for + // "server" errors. + public RetrySafety isRetrySafe() { + return RetrySafety.YES; + } + + // This value is code generated based on the retryable trait. + public boolean isThrottle() { + return true; + } + + public Duration retryAfter() { + return this.retryAfter; + } + + // This would be called by the deserializer. For HTTP protocols, for + // instance, it would be called if a `retry-after` header is present in the + // response. + public void setRetryAfter(Duration duration) { + this.retryAfter = duration; + } + + // This is the modeled message property. + public String message() { + return this.message; + } +} +``` + +## Example request loop + +The following is a simplified example of what it looks like to use a +`RetryStrategy` to implement a retryable request loop. + +```java +/** + * A simplified example of what a retryable request loop looks like. + * + * @param serializedRequest a request that has been fully serialized and is + * ready to send. + * + * @return a successful result. + */ +public Result request(SerializedRequest serializedRequest) { + // First acquire the initial retry token. If a token cannot be acquired, + // make only one attempt without retries. + RetryToken retryToken; + try { + retryToken = this.retryStrategy.acquireInitialToken(); + } catch (TokenAcquisitionFailedException e) { + return send(serializedRequest); + } + + // Make attempts until the request succeeds or the retry strategy throws + // an exception. Notably, each retry strategy is responsible for controlling + // the maximum number of attempts. + while (true) { + + // Wait for the indicated delay duration. Even the initial token may + // include a delay. + Thread.sleep(retryToken.delay()); + + Result result = null; + try { + result = send(serializedRequest); + } catch (Exception e) { + // Otherwise attempt to refresh the token. + try { + retryToken = this.retryStrategy.refreshRetryToken(retryToken, e); + } catch (TokenAcquisitionFailedException retryError) { + // If the token can't be acquired, the request fails, so the + // original exception needs to be propagated. Logging the reason + // the retry failed is advisable. + throw e; + } + } + + // If the result was successful, inform the retry strategy. This allows + // it to free up any resources if necessary. + if (result != null) { + this.retryStrategy.recordSuccess(retryToken); + return result; + } + } +} +``` + +Note that this code does not attempt to inspect the exceptions. It instead +passes them directly to the retry strategy, which then handles any information +in the exception that is relevant to it. From d291ad04371dbd6a1bf961708bcb6e6f0a973d14 Mon Sep 17 00:00:00 2001 From: JordonPhillips Date: Fri, 13 Feb 2026 11:47:03 +0100 Subject: [PATCH 2/3] Add example RetryStrategy implementation --- .../guides/client-guidance/index.md | 1 + .../guides/client-guidance/retries.md | 395 ++++++++++++++---- 2 files changed, 305 insertions(+), 91 deletions(-) diff --git a/docs/source-2.0/guides/client-guidance/index.md b/docs/source-2.0/guides/client-guidance/index.md index 35f8f914fde..d00242125ad 100644 --- a/docs/source-2.0/guides/client-guidance/index.md +++ b/docs/source-2.0/guides/client-guidance/index.md @@ -59,4 +59,5 @@ Smithy clients should follow these tenets: application-protocols/index context +retries ``` diff --git a/docs/source-2.0/guides/client-guidance/retries.md b/docs/source-2.0/guides/client-guidance/retries.md index cf2f6449aaf..e85b33d2498 100644 --- a/docs/source-2.0/guides/client-guidance/retries.md +++ b/docs/source-2.0/guides/client-guidance/retries.md @@ -25,44 +25,20 @@ new cascading failure conditions are observed. The right interface reflecting the problem domain can make sure the right extension points are available for future expansion. -## Retry behaviors - -The most basic retry behavior is a simple loop with no delays between attempts. -This is the most likely behavior to contribute to retry storms, but a simple -delay between attempts can be just as bad because it can result in spikes of -requests from the same system. - -Instead of a fixed delay, using **exponential backoff** to produce delays that -are longer each time balances the desire to get a quick success with the desire -to give the service more time to recover. Adding some randomness to that delay -(known as **jitter**) can result in a smoother request load. This strategy, -called **exponential backoff with jitter**, is relatively common. In AWS SDKs -this is the `standard` retry mode. - -More advanced retry behavior may be implemented by using a -[token bucket](https://en.wikipedia.org/wiki/Token_bucket) on top of exponential -backoff with jitter to dynamically adjust retry behavior in response to changing -service conditions. In short: if an attempt succeeds, some fraction of a token -is dropped in the bucket. If an attempt fails, a retry is only performed if a -whole token can be removed from the bucket. This results in the total number of -requests to a service dropping as the service degrades, improving its ability to -recover. When the service recovers, the token bucket fills back up and load -returns to normal. In AWS SDKs, this is the `adaptive` retry mode. - -These are only a few possible retry behaviors a client may have, but they -demonstrate some of the potential needs of the retry system. - ## Retry interfaces -It is recommended to implement retry behavior in a `RetryStrategy` that produces -`RetryToken`s to pass state between attempts. Passing state through tokens in -this way allows the `RetryStrategy` implementation to be isolated from the state -of an individual request. +It is recommended to expose retry interfaces that aren't coupled to a particular +implementation or protocol. It is recommended to have a `RetryStrategy` that is +isolated from the state of individual requests alongside `RetryToken`s to +capture that state and pass it between attempts. ### Retry token -The retry token itself should indicate a delay to wait before the next attempt -is made, but it may otherwise contain any state that is necessary for the retry +A `RetryToken` is a bundle of state that is created and passed between attempts +of a single request. It should indicate how long to wait until the next attempt, +but should allow each implementation to include whatever state they find +necessary. This could include the number of attempts that have been made, an +identifier for the request, or anything else that is necessary for the retry strategy. ```java @@ -70,19 +46,15 @@ public interface RetryToken { /** * @return the duration to wait until the next attempt is made. */ - public Duration delay(); + Duration delay(); } ``` ### Retry strategy -The `RetryStrategy` creates and refreshes retry tokens. When an attempt fails, -the retry strategy is passed the retry token for the attempt and given the -exception raised by the attempt. If a failed attempt may be retried, the retry -strategy will return a refreshed retry token to use for the next attempt. If a -failed attempt may not be retried, it throws an exception. If a request -succeeds, the retry strategy is given the token so that it may free up any -resources it was using. +A `RetryStrategy` is where the logic of computing delays and determining if a +request should be retried lives. It encapsulates the state of a request in retry +tokens, which it creates and refreshes. ```java public interface RetryStrategy { @@ -115,6 +87,87 @@ public interface RetryStrategy { } ``` +:::{note} + +While the state of a request is intended to be included in the retry token, a +retry strategy may still need to manage some state that is shared across the +client. Be careful to ensure that access to that state is synchronized in order +to prevent race conditions. +::: + +#### Using retry strategies + +An initial retry token should be acquired at the beginning of a request, before +the first attempt is made. If an initial token cannot be acquired, the client +should still make an attempt. + +If an attempt fails, the retry strategy is passed the retry token for the +attempt and given the exception raised by the attempt. If the retry strategy +determines that the failed attempt may be retried, it will return a refreshed +retry token to use for the next attempt. If the retry strategy determines that +the failed attempt may not be retried, it throws an exception. + +If the request succeeds, the retry strategy is given the token so that it may +free up any resources it was using. + +The following is a simplified example of what it looks like to use the +`RetryStrategy` interface to implement a retryable request loop. + +```java +/** + * A simplified example of what a retryable request loop looks like. + * + * @param serializedRequest a request that has been fully serialized and is + * ready to send. + * + * @return a successful result. + */ +public Result request(SerializedRequest serializedRequest) { + // First acquire the initial retry token. If a token cannot be acquired, + // make only one attempt without retries. + RetryToken retryToken; + try { + retryToken = this.retryStrategy.acquireInitialToken(); + } catch (TokenAcquisitionFailedException e) { + return send(serializedRequest); + } + + // Make attempts until the request succeeds or the retry strategy throws + // an exception. Notably, each retry strategy is responsible for controlling + // the maximum number of attempts. + while (true) { + + // Wait for the indicated delay duration. Even the initial token may + // include a delay. + if (retryToken.delay() != null) { + Thread.sleep(retryToken.delay()); + } + + Result result = null; + try { + result = send(serializedRequest); + } catch (Exception e) { + // If the request fails, attempt to refresh the retry token. + try { + retryToken = this.retryStrategy.refreshRetryToken(retryToken, e); + } catch (TokenAcquisitionFailedException retryError) { + // If the token can't be refreshed, the request fails, so the + // original exception needs to be propagated. Logging the reason + // the retry failed is advisable. + throw e; + } + } + + // If the result was successful, inform the retry strategy. This allows + // it to free up any resources if necessary. + if (result != null) { + this.retryStrategy.recordSuccess(retryToken); + return result; + } + } +} +``` + ### Retryable errors Request attempts can throw many different types of exceptions and it is not @@ -135,6 +188,9 @@ In particular, exceptions should indicate: returned by a service specifically to indicate that too many requests have been made recently. For HTTP protocols, for example, a `429` status code indicates this. +* Whether they are a timeout error. That is, whether the error is a result of + the service not responding within the transport client's defined timeout + limit. For HTTP protocols, a `504` status code could also indicate this. * A minimum time to wait until the next request, if that information is available. For HTTP protocols, for example, this could be indicated by the [`Retry-After`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Retry-After) @@ -162,7 +218,18 @@ public interface RetryInfo { * * @return the error type. */ - boolean isThrottle(); + default boolean isThrottle() { + return false; + } + + /** + * Check if the error is a timeout error. + * + * @return the error type. + */ + default boolean isTimeout() { + return false; + } /** * Get the amount of time to wait before retrying. @@ -170,7 +237,9 @@ public interface RetryInfo { * @return the time to wait before retrying, or null if no hint for a * retry-after was detected. */ - Duration retryAfter(); + default Duration retryAfter() { + return null; + } /** * Whether it's safe to retry. @@ -251,7 +320,7 @@ public class RetryAfterException extends RuntimeException implements ErrorInfo, private Duration retryAfter = null; public RetryAfterException(String message) { - super(message) + super(message); this.message = message; } @@ -290,64 +359,208 @@ public class RetryAfterException extends RuntimeException implements ErrorInfo, } ``` -## Example request loop +## Retry behaviors + +The most basic retry behavior is a simple loop with no delays between attempts. +This is the most likely behavior to contribute to retry storms, but a simple +delay between attempts can be just as bad because it can result in spikes of +requests from the same system. + +Instead of a fixed delay, using **exponential backoff** to produce delays that +are exponentially longer each time balances the desire to get a quick success +with the desire to give the service more time to recover. In particular, making +the backoff exponential instead of linear (for example, 1->2->4->8 instead of +1->2->3->4) results in the first few attempts still happening relatively +quickly. This means that temporary issues with the network won't delay requests +much. If attempts keep failing, the exponentially increasing backoff gives a +struggling service more time to recover. + +Exponential backoff doesn't solve all problems. If a large number of clients +make a request at the same time, the service might struggle to respond to any of +them. If they all retry at the same time, this might make the problem worse. +Exponential backoff does not prevent this since all the clients will be using +it, and so the problem will re-occur. To solve this problem, we add a random +factor to the delay known as **jitter**. This results in a smoother load, +preventing another flood of requests and making it easier for the service to +recover. + +The combination of these strategies is known as **exponential backoff with +jitter**, and is relatively common. + +More advanced retry behavior may be implemented by using a +[token bucket](https://en.wikipedia.org/wiki/Token_bucket) on top of exponential +backoff with jitter to dynamically adjust retry behavior in response to changing +service conditions. + +In short: if an attempt fails, a retry is only performed if a whole token can be +removed from the bucket. If an attempt succeeds, a fraction of a token is put +into the bucket. When the bucket is empty, no retries will be performed. This +means that the total number of requests to a service will drop significantly as +the service degrades, improving its ability to recover. The bucket fills back up +as the service failure rate goes down, once again allowing retries to be +performed. In AWS SDKs, this is how the `standard` retry mode works. + +These are only a few possible retry behaviors a client may have, but they +demonstrate some of the potential needs of a retry system. + +### Example retry strategy -The following is a simplified example of what it looks like to use a -`RetryStrategy` to implement a retryable request loop. +The following is an example retry strategy that implements exponential backoff +with jitter alongside a token bucket. This strategy adds extra cost for timeout +errors since they may indicate a more highly degraded service. + +Aside from delay, the retry token also tracks the number of attempts that have +been made. This is necessary because this strategy imposes a maximum attempt +count, and also because the delay is calculated in part based on how many +attempts have been made. ```java -/** - * A simplified example of what a retryable request loop looks like. - * - * @param serializedRequest a request that has been fully serialized and is - * ready to send. - * - * @return a successful result. - */ -public Result request(SerializedRequest serializedRequest) { - // First acquire the initial retry token. If a token cannot be acquired, - // make only one attempt without retries. - RetryToken retryToken; - try { - retryToken = this.retryStrategy.acquireInitialToken(); - } catch (TokenAcquisitionFailedException e) { - return send(serializedRequest); +public record AwsStandardRetryToken(int attempts, Duration delay) implements RetryToken { +} +``` + +```java +public final class AwsStandardRetryStrategy implements RetryStrategy { + // These values are not prescriptive. They are static in this example for the + // sake of simplicity, but making them configurable is ideal. + private static final int RETRY_COST = 5; + private static final int TIMEOUT_COST = 10; + private static final int SUCCESS_REFUND = 1; + + private static final int MAX_ATTEMPTS = 5; + private static final int MAX_BACKOFF = 20; + private static final int MAX_CAPACITY = 500; + + // The token bucket is integrated into this retry strategy in this example, + // but in a real client it may be better to have it be its own type so + // that it can be shared, and so that managing concurrency is simpler. + private int tokens = MAX_CAPACITY; + + // Be careful to consider concurrency when designing retry strategies. + // When there are multiple threads accessing the token bucket, proper + // synchronization is essential to prevent race conditions. + private final Object tokensLock = new Object(); + + @Override + public RetryToken acquireInitialToken() { + // This returns successfully even if the token bucket is empty. This is + // because an initial attempt will always be performed anyway, and + // returning successfully here will ensure that the retry strategy is + // checked if that initial attempt fails. By that point, the token bucket + // may no longer be empty. + return new AwsStandardRetryToken(0, null); } - // Make attempts until the request succeeds or the retry strategy throws - // an exception. Notably, each retry strategy is responsible for controlling - // the maximum number of attempts. - while (true) { + @Override + public RetryToken refreshRetryToken(RetryToken token, Throwable failure) { + // First, ensure that the provided token is of the correct type. + if (!(token instanceof AwsStandardRetryToken standardToken)) { + throw new IllegalArgumentException("Invalid token provided for refresh."); + } - // Wait for the indicated delay duration. Even the initial token may - // include a delay. - Thread.sleep(retryToken.delay()); + // Next, check to see if the maximum number of attempts has already + // been exceeded. + if (standardToken.attempts >= MAX_ATTEMPTS) { + throw new TokenAcquisitionFailedException("Max attempts exhausted."); + } - Result result = null; - try { - result = send(serializedRequest); - } catch (Exception e) { - // Otherwise attempt to refresh the token. - try { - retryToken = this.retryStrategy.refreshRetryToken(retryToken, e); - } catch (TokenAcquisitionFailedException retryError) { - // If the token can't be acquired, the request fails, so the - // original exception needs to be propagated. Logging the reason - // the retry failed is advisable. - throw e; + // Examine the exception thrown by the operation to determine an + // appropriate delay, if any. + return switch (failure) { + // If the exception thrown by the operation includes retryability + // information, use that to inform retry behavior. + case RetryInfo retryInfo when retryInfo.isRetrySafe() != RetrySafety.NO -> { + // Attempt to consume tokens from the token bucket to "pay" + // for the retry. + consumeTokens(retryInfo.isTimeout()); + yield backoff(standardToken, retryInfo.retryAfter()); } + + // If the exception does not have retry info, but does have more + // general error info, that can also be used. This assumes that + // a server error is likely retryable and that a client error + // likely is not. + case ErrorInfo errorInfo when errorInfo.fault() == ErrorFault.SERVER -> { + consumeTokens(false); + yield backoff(standardToken); + } + default -> throw new TokenAcquisitionFailedException("Exception not retryable."); + }; + } + + /** + * Consumes tokens to "pay" for a retry. + * + * @param isTimeout whether the retry is in response to a timeout error, + * which will require more tokens. + * + * @throws TokenAcquisitionFailedException if there are not enough tokens + * in the bucket to pay for the retry. + */ + private void consumeTokens(boolean isTimeout) { + synchronized (tokensLock) { + int cost = isTimeout ? TIMEOUT_COST : RETRY_COST; + + if (this.tokens < cost) { + throw new TokenAcquisitionFailedException("Token bucket exhausted."); + } + + this.tokens -= cost; } + } - // If the result was successful, inform the retry strategy. This allows - // it to free up any resources if necessary. - if (result != null) { - this.retryStrategy.recordSuccess(retryToken); - return result; + /** + * Computes a backoff with exponential backoff and jitter, capped at 20 seconds. + * + * @param token the previous token. + */ + private AwsStandardRetryToken backoff(AwsStandardRetryToken token) { + return new AwsStandardRetryToken(token.attempts + 1, computeDelay(token.attempts)); + } + + /** + * Computes a backoff with exponential backoff and jitter, capped at 20 seconds. + * + * @param token the previous token. + * @param suggested the delay suggested by the service, which will serve as + * the minimum delay. + */ + private AwsStandardRetryToken backoff(AwsStandardRetryToken token, Duration suggested) { + // Compute the backoff as normal. If it is longer than the suggested + // backoff from the service, use it. Otherwise, use the suggested + // backoff. + Duration computedDelay = computeDelay(token.attempts); + Duration finalDelay = computedDelay.toMillis() < suggested.toMillis() ? suggested : computedDelay; + return new AwsStandardRetryToken(token.attempts + 1, finalDelay); + } + + /** + * Computes the delay with exponential backoff and jitter, capped at 20 seconds. + * + * @param attempts the number of attempts made so far. + * @return the computed delay duration. + */ + private Duration computeDelay(int attempts) { + // First compute the exponential backoff. + double backoff = Math.pow(2, attempts); + + // Next, cap it at 20 seconds. + backoff = Math.min(backoff, MAX_BACKOFF); + + // Finally, add jitter and expand to milliseconds. + double backoffMillis = Math.random() * backoff * 1000; + return Duration.ofMilliseconds((long) backoffMillis); + } + + @Override + public void recordSuccess(RetryToken token) { + synchronized (tokensLock) { + // When a successful request is made, refill the token bucket unless it + // is already at maximum capacity. + if (this.tokens < MAX_CAPACITY) { + this.tokens += SUCCESS_REFUND; + } } } } ``` - -Note that this code does not attempt to inspect the exceptions. It instead -passes them directly to the retry strategy, which then handles any information -in the exception that is relevant to it. From 6a02b98cc259baf7ff60f0da660c780a620aa6c6 Mon Sep 17 00:00:00 2001 From: JordonPhillips Date: Tue, 17 Feb 2026 14:33:47 +0100 Subject: [PATCH 3/3] Add explanation for why an attempt is always made --- docs/conf.py | 6 ++++++ docs/source-2.0/guides/client-guidance/retries.md | 13 ++++++++++--- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index a3fccb8c5c5..105906d9d4e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -24,6 +24,12 @@ smartquotes = False nitpicky = True +# -- Markdown configuration ----------------------------------------------- + +myst_enable_extensions = [ + "colon_fence" +] + # -- Options for HTML output ---------------------------------------------- html_theme = "furo" diff --git a/docs/source-2.0/guides/client-guidance/retries.md b/docs/source-2.0/guides/client-guidance/retries.md index e85b33d2498..6242943baa8 100644 --- a/docs/source-2.0/guides/client-guidance/retries.md +++ b/docs/source-2.0/guides/client-guidance/retries.md @@ -61,6 +61,11 @@ public interface RetryStrategy { /** * Invoked before the first request attempt. * + *

An initial request will be made even if a token cannot be acquired. + * It is recommended to always return a token so that the retry strategy + * may continue to track each request and be informed when they succeed or + * fail. + * * @throws TokenAcquisitionFailedException if a token cannot be acquired. */ RetryToken acquireInitialToken(); @@ -99,7 +104,9 @@ to prevent race conditions. An initial retry token should be acquired at the beginning of a request, before the first attempt is made. If an initial token cannot be acquired, the client -should still make an attempt. +should still make an attempt. This initial attempt is always made because the +purpose of the retry strategy is to manage retries, not to gate access to a +service entirely. If an attempt fails, the retry strategy is passed the retry token for the attempt and given the exception raised by the attempt. If the retry strategy @@ -407,7 +414,7 @@ demonstrate some of the potential needs of a retry system. The following is an example retry strategy that implements exponential backoff with jitter alongside a token bucket. This strategy adds extra cost for timeout -errors since they may indicate a more highly degraded service. +errors since they may indicate a more degraded service. Aside from delay, the retry token also tracks the number of attempts that have been made. This is necessary because this strategy imposes a maximum attempt @@ -557,7 +564,7 @@ public final class AwsStandardRetryStrategy implements RetryStrategy { synchronized (tokensLock) { // When a successful request is made, refill the token bucket unless it // is already at maximum capacity. - if (this.tokens < MAX_CAPACITY) { + if (this.tokens <= MAX_CAPACITY - SUCCESS_REFUND) { this.tokens += SUCCESS_REFUND; } }