Skip to content

HTTP/2 GOAWAY under certain kinds of load returned to user as error--not retried #1402

@bddap

Description

@bddap

Describe the bug

Useful context:

The issue

When concurrently calling a lambda through aws-sdk-lambda, DispatchFailure errors are returned to user code at seemingly regular intervals.

HTTP/2 GOAWAY errors not retried, causing DispatchFailure under sustained load

Summary

When making sustained high-concurrency requests to AWS APIs using the AWS SDK for Rust, the SDK intermittently fails with DispatchFailure errors complaining about HTTP/2 GOAWAY frames.
These errors are not retried automatically by the SDK, instead the failures that propagate to application code.

DispatchFailure(DispatchFailure { source: ConnectorError { kind: Io,
  source: hyper_util::client::legacy::Error(SendRequest,
    hyper::Error(Http2, Error { kind: GoAway(b"", NO_ERROR, Remote) })),
  connection: Unknown } })

To Reproduce

  • aws-sdk-lambda: 1.113.0
  • aws-config: 1.8.12
  • aws-smithy-runtime: 1.9.8
  • aws-smithy-http-client: 1.1.6 so this is occurs even after #4145
  • Rust: 1.85 (2024 edition)
  • Platform: Linux x86_64 AND Firecracker VM (when making calls from aws lambda)

Quickest way to encouter the error seems to be to repeatedly send batches of concurrent requests.

use anyhow::Result;
use aws_config::BehaviorVersion;
use aws_sdk_lambda::Client;
use aws_smithy_runtime_api::client::result::SdkError;
use aws_smithy_types::Blob;
use futures_util::future::join_all;

const FUNCTION_ARN: &str = "YOUR_LAMBDA_ARN";
const PAYLOAD: &str = "[]";
const BATCH_SIZE: usize = 100;
const MAX_BATCHES: usize = 1000;

/// Check if an SDK error is a dispatch failure (e.g., HTTP/2 GOAWAY).
fn is_dispatch_failure<E, R>(err: &SdkError<E, R>) -> bool {
    matches!(err, SdkError::DispatchFailure(_))
}

#[tokio::main]
async fn main() -> Result<()> {
    let config = aws_config::load_defaults(BehaviorVersion::latest()).await;
    let client = Client::new(&config);

    for batch in 0..MAX_BATCHES {
        let mut tasks = Vec::with_capacity(BATCH_SIZE);

        for _ in 0..BATCH_SIZE {
            let client = client.clone();
            tasks.push(tokio::spawn(async move {
                client
                    .invoke()
                    .function_name(FUNCTION_ARN)
                    .payload(Blob::new(PAYLOAD))
                    .send()
                    .await
            }));
        }

        let results = join_all(tasks).await;
        for res in results {
            match res.unwrap() {
                Ok(_) => {}
                Err(e) => {
                    eprintln!("failure at batch {}: {:?}", batch, e);
                    assert!(is_dispatch_failure(&e));
                    return Err(e.into());
                }
            }
        }

        if batch % 10 == 0 {
            eprintln!("Completed batch {}", batch);
        }
    }

    Ok(())
}

Clues

When batching, we pretty consistenly hit the error on at batch index 99. What a suspicious number!
If batch size is 100 and we hit the error on the 100th batch, we are hitting the error around request 10,000.

Specuation

If I had to guess, I'd say these api requests are passing throug a load balancer configured to kill the connection
after 10,000 request.

Potential Fixes (assuming speculation is correct)

  1. Remove the Stream ID cap on the load balancer serving the api (allow more than 10,000 request per stream.
  2. or, account for the limitation within aws client libraries, either gracefully handle the error, or explicitly
    create a new connection on request number 10,001 .

1. seems ideal but would involve some beurocratic challenges I imagine.

Workaround

In the meantime, if you are another user encountering this error, you might want to retry:

fn do_retry<E, R>(err: &SdkError<E, R>) -> bool {
    matches err {
	   SdkError::DispatchFailure(d) => d.is_io() || d.is_user(),
       _ => false,
	}
}

// `d.is_io()` covers initial error,
// `d.is_user()` covers the fallout, other request on the same connection suffer when the connection drops

I find that this error can be retried immediately, no delay, and the request succeds the second time.
However, you might be calling the same lambda extra times when you retry here.

Side-note

If load balancer is killing the connection after the http2 GOAWAY,
as is the default on aws load balancer,
then we might be wasting paid customer lambda requests. I'd wonder about in-progress
requests not being reported back to the user. Normally already-established h2 streams
are exepected to stay alive after GOAWAY. GOAWAY just prevents creation of new streams.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Current Behavior

Reproduction Steps

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.8.12
│   ├── aws-credential-types v1.2.11
│   │   ├── aws-smithy-async v1.2.8
│   │   ├── aws-smithy-runtime-api v1.11.0
│   │   │   ├── aws-smithy-async v1.2.8 (*)
│   │   │   ├── aws-smithy-types v1.4.0
│   │   ├── aws-smithy-types v1.4.0 (*)
│   ├── aws-runtime v1.5.18
│   │   ├── aws-credential-types v1.2.11 (*)
│   │   ├── aws-sigv4 v1.3.7
│   │   │   ├── aws-credential-types v1.2.11 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.14
│   │   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   │   ├── aws-smithy-http v0.62.6
│   │   │   │   ├── aws-smithy-eventstream v0.60.14 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-smithy-async v1.2.8 (*)
│   │   ├── aws-smithy-eventstream v0.60.14 (*)
│   │   ├── aws-smithy-http v0.62.6 (*)
│   │   ├── aws-smithy-runtime v1.9.8
│   │   │   ├── aws-smithy-async v1.2.8 (*)
│   │   │   ├── aws-smithy-http v0.62.6 (*)
│   │   │   ├── aws-smithy-http-client v1.1.6
│   │   │   │   ├── aws-smithy-async v1.2.8 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   │   │   │   │   ├── aws-lc-rs v1.15.3
│   │   │   │   │   │   │   ├── aws-lc-sys v0.36.0
│   │   │   │   │   │   │   ├── aws-lc-rs v1.15.3 (*)
│   │   │   ├── aws-smithy-observability v0.2.0
│   │   │   │   └── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-types v1.3.11
│   │   │   ├── aws-credential-types v1.2.11 (*)
│   │   │   ├── aws-smithy-async v1.2.8 (*)
│   │   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   │   ├── aws-smithy-types v1.4.0 (*)
│   ├── aws-sdk-sso v1.92.0
│   │   ├── aws-credential-types v1.2.11 (*)
│   │   ├── aws-runtime v1.5.18 (*)
│   │   ├── aws-smithy-async v1.2.8 (*)
│   │   ├── aws-smithy-http v0.62.6 (*)
│   │   ├── aws-smithy-json v0.61.9
│   │   │   └── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-smithy-observability v0.2.0 (*)
│   │   ├── aws-smithy-runtime v1.9.8 (*)
│   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-types v1.3.11 (*)
│   ├── aws-sdk-ssooidc v1.94.0
│   │   ├── aws-credential-types v1.2.11 (*)
│   │   ├── aws-runtime v1.5.18 (*)
│   │   ├── aws-smithy-async v1.2.8 (*)
│   │   ├── aws-smithy-http v0.62.6 (*)
│   │   ├── aws-smithy-json v0.61.9 (*)
│   │   ├── aws-smithy-observability v0.2.0 (*)
│   │   ├── aws-smithy-runtime v1.9.8 (*)
│   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-types v1.3.11 (*)
│   ├── aws-sdk-sts v1.96.0
│   │   ├── aws-credential-types v1.2.11 (*)
│   │   ├── aws-runtime v1.5.18 (*)
│   │   ├── aws-smithy-async v1.2.8 (*)
│   │   ├── aws-smithy-http v0.62.6 (*)
│   │   ├── aws-smithy-json v0.61.9 (*)
│   │   ├── aws-smithy-observability v0.2.0 (*)
│   │   ├── aws-smithy-query v0.60.9
│   │   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-smithy-runtime v1.9.8 (*)
│   │   ├── aws-smithy-runtime-api v1.11.0 (*)
│   │   ├── aws-smithy-types v1.4.0 (*)
│   │   ├── aws-smithy-xml v0.60.13
│   │   ├── aws-types v1.3.11 (*)
│   ├── aws-smithy-async v1.2.8 (*)
│   ├── aws-smithy-http v0.62.6 (*)
│   ├── aws-smithy-json v0.61.9 (*)
│   ├── aws-smithy-runtime v1.9.8 (*)
│   ├── aws-smithy-runtime-api v1.11.0 (*)
│   ├── aws-smithy-types v1.4.0 (*)
│   ├── aws-types v1.3.11 (*)
├── aws-credential-types v1.2.11 (*)
├── aws-sdk-lambda v1.113.0
│   ├── aws-credential-types v1.2.11 (*)
│   ├── aws-runtime v1.5.18 (*)
│   ├── aws-smithy-async v1.2.8 (*)
│   ├── aws-smithy-eventstream v0.60.14 (*)
│   ├── aws-smithy-http v0.62.6 (*)
│   ├── aws-smithy-json v0.61.9 (*)
│   ├── aws-smithy-observability v0.2.0 (*)
│   ├── aws-smithy-runtime v1.9.8 (*)
│   ├── aws-smithy-runtime-api v1.11.0 (*)
│   ├── aws-smithy-types v1.4.0 (*)
│   ├── aws-types v1.3.11 (*)
├── aws-sdk-sts v1.96.0 (*)
├── aws-smithy-http-client v1.1.6 (*)
├── aws-smithy-runtime v1.9.8 (*)
├── aws-smithy-runtime-api v1.11.0 (*)
├── aws-smithy-types v1.4.0 (*)

Environment details (OS name and version, etc.)

Firecracker VM and recent linux x86_64

Logs

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis issue is a bug.p2This is a standard priority issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions