awslabs/aws-sdk-rust

Minimum throughput detection incorrectly raises an error

GeorgeHahn opened this issue · 6 comments

Describe the bug

We observe frequent low throughput errors when reading a streamed S3 body in a lambda function. Our theory is that the decompression and business logic for processing the stream cause the async runtime to behave in a CPU-constrained manner and appears to break keepalive on the underlying stream. This appears to be a worst-case scenario for the throughput estimation logic. The attached reproduction returns an error even though reads separated by 120ms are able to receive data.

Regression Issue

  • Select this option if this issue appears to be a regression.

Expected Behavior

Minimum throughput errors should not be raised unless the connection is unable to provide data.

Current Behavior

A minimum throughput error is incorrectly raised.

Reproduction Steps

Cargo.toml

[package]
name = "s3-throughput-min-repro"
version = "0.1.0"
edition = "2021"

[dependencies]
anyhow = "1.0"
aws-config = { version = "1.5", default-features = false, features = ["client-hyper", "rt-tokio", "rustls"] }
aws-sdk-s3 = { version = "1.42", default-features = false, features = ["rt-tokio", "rustls"] }
clap = { version = "4.5", default-features = false, features = ["std", "derive"]}
tokio = { version = "1.23", default-features = false, features = ["full"] }
tracing = { version = "0.1", default-features = false }
tracing-subscriber = { version = "0.3", default-features = false, features = ["env-filter", "fmt", "std"] }

main.rs

use aws_config::BehaviorVersion;
use aws_sdk_s3::operation::RequestIdExt;
use clap::Parser;
use tokio::io::{AsyncRead, AsyncReadExt};
use tracing::info;

/// Read a file from S3. Reproduces a bug where the AWS SDK returns a low throughput error.
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Cli {
    /// S3 bucket to read from
    #[arg(long)]
    bucket: String,
    /// S3 path to read
    #[arg(long)]
    key: String,
}

pub async fn remote_object<'a>(
    s3_client: &'a aws_sdk_s3::Client,
    bucket: &str,
    key: &str,
) -> Result<impl AsyncRead, anyhow::Error> {
    let object = s3_client
        .get_object()
        .bucket(bucket)
        .key(key)
        .send()
        .await?;
    if let Some(id) = object.extended_request_id() {
        info!("Request ID: {id}");
    }
    Ok(object.body.into_async_read())
}

async fn repro(opts: Cli) -> Result<(), anyhow::Error> {
    let config = aws_config::defaults(BehaviorVersion::latest())
        .region("us-west-2")
        .load()
        .await;
    let s3_client = aws_sdk_s3::Client::new(&config);

    let mut object = remote_object(&s3_client, &opts.bucket, &opts.key).await?;
    let mut buffer = Vec::with_capacity(5000000);

    // Low throughput detection uses a 10-bin[1] window[2] to track requests
    // over the last second[3,4]. There is a 20 second grace period before a
    // throughput error is raised, implemented as a delay future[5,6]. Errors
    // are suppressed if the stream is not being polled, detected by half of the
    // bins being empty[7].
    //
    // 1: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L273
    // 2: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L334-L339
    // 3: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/options.rs#L79
    // 4: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L309
    // 5: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime-api/src/client/stalled_stream_protection.rs#L16-L20
    // 6: https://github.com/awslabs/aws-sdk-rust/blob/bbf69c2279fa9cdd1af5327ba58c7dad0cb62633/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/http_body_0_4_x.rs#L111-L127
    // 7: https://github.com/awslabs/aws-sdk-rust/blob/main/sdk/aws-smithy-runtime/src/client/http/body/minimum_throughput/throughput.rs#L401

    // This repro works by filling the bins with pending requests, which are
    // counted as no data in the throughput calculation. The incoming bytes are
    // logged into bins, but the pending request label is sticky and takes
    // precedence.
    //
    // Full runtime pauses are used to simulate running alongside a CPU-bound
    // workload. (In our case, a lambda that's busy decompressing incoming
    // data.)
    //
    // Seems to me this is effectively the worst-case scenario for slow readers.
    // The throughput estimation does not work if most polls return pending
    // before becoming ready.
    //

    // Send requests just above the resolution limit to fill the throughput buckets
    loop {
        let n = object.read_buf(&mut buffer).await?;
        buffer.clear();

        std::thread::sleep(std::time::Duration::from_millis(120));

        // Good-enough EOF check
        if n == 0 {
            break Ok(());
        }
    }
}

#[tokio::main(flavor = "current_thread")]
async fn main() {
    if std::env::var("RUST_LOG").is_err() {
        std::env::set_var("RUST_LOG", "trace");
    }
    tracing_subscriber::fmt::init();
    let opts = Cli::parse();
    let res = repro(opts).await;
    if let Err(e) = res {
        eprintln!("{:?}", e);
    }
}

Possible Solution

No response

Additional Information/Context

No response

Version

├── aws-config v1.5.7
│   ├── aws-credential-types v1.2.1
│   │   ├── aws-smithy-async v1.2.1
│   │   ├── aws-smithy-runtime-api v1.7.2
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-types v1.2.7
│   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-runtime v1.4.3
│   │   ├── aws-credential-types v1.2.1 (*)
│   │   ├── aws-sigv4 v1.2.4
│   │   │   ├── aws-credential-types v1.2.1 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.5
│   │   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   │   ├── aws-smithy-http v0.60.11
│   │   │   │   ├── aws-smithy-eventstream v0.60.5 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-eventstream v0.60.5 (*)
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-runtime v1.7.1
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.11 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-types v1.3.3
│   │   │   ├── aws-credential-types v1.2.1 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-sdk-sts v1.44.0
│   │   ├── aws-credential-types v1.2.1 (*)
│   │   ├── aws-runtime v1.4.3 (*)
│   │   ├── aws-smithy-async v1.2.1 (*)
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-json v0.60.7
│   │   │   └── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-query v0.60.7
│   │   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-runtime v1.7.1 (*)
│   │   ├── aws-smithy-runtime-api v1.7.2 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   │   ├── aws-smithy-xml v0.60.9
│   │   ├── aws-types v1.3.3 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-http v0.60.11 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.7.1 (*)
│   ├── aws-smithy-runtime-api v1.7.2 (*)
│   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-types v1.3.3 (*)
├── aws-sdk-s3 v1.52.0
│   ├── aws-credential-types v1.2.1 (*)
│   ├── aws-runtime v1.4.3 (*)
│   ├── aws-sigv4 v1.2.4 (*)
│   ├── aws-smithy-async v1.2.1 (*)
│   ├── aws-smithy-checksums v0.60.12
│   │   ├── aws-smithy-http v0.60.11 (*)
│   │   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-smithy-eventstream v0.60.5 (*)
│   ├── aws-smithy-http v0.60.11 (*)
│   ├── aws-smithy-json v0.60.7 (*)
│   ├── aws-smithy-runtime v1.7.1 (*)
│   ├── aws-smithy-runtime-api v1.7.2 (*)
│   ├── aws-smithy-types v1.2.7 (*)
│   ├── aws-smithy-xml v0.60.9 (*)
│   ├── aws-types v1.3.3 (*)


### Environment details (OS name and version, etc.)

Repro tested on macos. We're hitting this in production on x86_64 lambdas running al2023 and 375 MiB of memory.

### Logs

_No response_

Hi @GeorgeHahn, thank you for bringing this to our attention and for providing detailed reproduction steps. We will investigate to see if we can first replicate the issue on our end.

Still trying to reproduce it on our end. In the meantime, could you provide us with the following information?

  • The size of an object stored in s3 that tends to trigger a stall stream protection error, given the above reproduction code
  • Snippet of a trace log, if available, leading up to a stall stream protection error you're observing (please be careful to not include sensitive information in the log)

The objects in this case are typically 4-10MiB.

Log attached. I removed the config and setup log lines for simplicity. Let me know if there are any details you'd like to see from that part.
repro.log

Thank you, we've reproduced the issue on our end. Will further investigate it.

This should be fixed in today's release of the aws-sdk-s3 crate (1.55.00). Thank you for the reproduction, we've used it to add an integration test so we can detect if this regresses in the future!

Fix: smithy-lang/smithy-rs#3871
Integration test based on your repro: smithy-lang/smithy-rs#3874

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.