awslabs/aws-sdk-rust

S3 `complete_multipart_upload` are not retried automatically on internal error

sticnarf opened this issue · 6 comments

Describe the bug

When I call complete_multipart_upload to complete an S3 multipart upload, occasionally I get an internal error thrown from S3 service.

And I set a quite large limit for the auto retry:

let retry_config = RetryConfig::standard()
            .with_max_attempts(128)
            .with_initial_backoff(Duration::from_millis(200))
            .with_max_backoff(Duration::from_millis(5000));

Expected Behavior

The complete_multipart_upload request always succeeds.

Current Behavior

Sometimes, I get errors like this. The error shows up quickly after complete_multipart_upload is called, obviously not exceeding the retry limit.

ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("InternalError"), message: Some("We encountered an internal error. Please try again."), extras: Some({"s3_extended_request_id": "SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==", "aws_request_id": "RBHQNEJ7EJB02AP6"}) }, meta: ErrorMetadata { code: Some("InternalError"), message: Some("We encountered an internal error. Please try again."), extras: Some({"s3_extended_request_id": "SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==", "aws_request_id": "RBHQNEJ7EJB02AP6"}) } }), raw: Response { status: StatusCode(200), headers: Headers { headers: {"x-amz-id-2": HeaderValue { _private: H0("SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==") }, "x-amz-request-id": HeaderValue { _private: H0("RBHQNEJ7EJB02AP6") }, "date": HeaderValue { _private: H0("Tue, 11 Jun 2024 03:48:43 GMT") }, "x-amz-server-side-encryption": HeaderValue { _private: H0("AES256") }, "content-type": HeaderValue { _private: H0("application/xml") }, "transfer-encoding": HeaderValue { _private: H0("chunked") }, "server": HeaderValue { _private: H0("AmazonS3") }} }, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>RBHQNEJ7EJB02AP6</RequestId><HostId>SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==</HostId></Error>")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } } })

I'm not sure if the 200 HTTP status makes the SDK not retry automatically.

Reproduction Steps

The code generally looks like this. It's almost the same as the example code.

const UPLOAD_TTL_SECS: i64 = 24 * 60 * 60; // 1 day
let expires = DateTime::from_secs(
    SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .unwrap()
        .as_secs() as i64
        + UPLOAD_TTL_SECS,
);
let res = self
    .client
    .create_multipart_upload()
    .bucket(&self.bucket)
    .key(key.clone())
    .expires(expires)
    .send()
    .await?;
let upload_id = res.upload_id.unwrap();
const CHUNK_SIZE: u64 = 64 * 1024 * 1024; // 64 MiB
const UPLOAD_BUF_SIZE: usize = 128 * 1024; // 128 KiB
let file_size = tokio::fs::metadata(local_path).await?.len();
let mut upload_sets = JoinSet::new();
let mut part_num = 1;
loop {
    let offset = (part_num - 1) * CHUNK_SIZE;
    if offset >= file_size {
        break;
    }
    let len = CHUNK_SIZE.min(file_size - offset);
    let fs_builder = ByteStream::read_from()
        .path(local_path)
        .offset(offset)
        .length(Length::Exact(len))
        .buffer_size(UPLOAD_BUF_SIZE);
    let upload = self
        .client
        .upload_part()
        .bucket(&self.bucket)
        .key(key.clone())
        .part_number(part_num as i32)
        .upload_id(upload_id.clone());
    upload_sets.spawn(async move {
        let data = fs_builder.build().await?;
        let output = upload.body(data).send().await?;
        let complete_part = CompletedPart::builder()
            .set_e_tag(output.e_tag)
            .part_number(part_num as i32)
            .build();
        Ok::<_, DataError>(complete_part)
    });
    part_num += 1;
}
let mut parts = Vec::new();
while let Some(res) = upload_sets.join_next().await {
    parts.push(res??);
}
parts.sort_by_key(|p| p.part_number());
let _output = self
    .client
    .complete_multipart_upload()
    .bucket(&self.bucket)
    .key(key)
    .upload_id(upload_id)
    .multipart_upload(
        CompletedMultipartUpload::builder()
            .set_parts(Some(parts))
            .build(),
    )
    .send()
    .await?;

Possible Solution

No response

Additional Information/Context

No response

Version

│   │   ├── aws-config v1.5.1
│   │   │   ├── aws-credential-types v1.2.0
│   │   │   │   ├── aws-smithy-async v1.2.1
│   │   │   │   ├── aws-smithy-runtime-api v1.6.3
│   │   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   │   ├── aws-smithy-types v1.2.0
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   ├── aws-runtime v1.2.2
│   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   ├── aws-sigv4 v1.2.1
│   │   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.8
│   │   │   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-types v1.3.1
│   │   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   ├── aws-sdk-sso v1.29.0
│   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   ├── aws-runtime v1.2.2 (*)
│   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   ├── aws-smithy-json v0.60.7
│   │   │   │   │   └── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-smithy-runtime v1.5.7
│   │   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-types v1.3.1 (*)
│   │   │   ├── aws-sdk-ssooidc v1.30.0
│   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   ├── aws-runtime v1.2.2 (*)
│   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   ├── aws-smithy-json v0.60.7 (*)
│   │   │   │   ├── aws-smithy-runtime v1.5.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-types v1.3.1 (*)
│   │   │   ├── aws-sdk-sts v1.29.0
│   │   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   │   ├── aws-runtime v1.2.2 (*)
│   │   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   ├── aws-smithy-json v0.60.7 (*)
│   │   │   │   ├── aws-smithy-query v0.60.7
│   │   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-smithy-runtime v1.5.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   │   ├── aws-smithy-xml v0.60.8
│   │   │   │   ├── aws-types v1.3.1 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   ├── aws-smithy-json v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime v1.5.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   ├── aws-types v1.3.1 (*)
│   │   ├── aws-sdk-s3 v1.34.0
│   │   │   ├── aws-credential-types v1.2.0 (*)
│   │   │   ├── aws-runtime v1.2.2 (*)
│   │   │   ├── aws-sigv4 v1.2.1 (*)
│   │   │   ├── aws-smithy-async v1.2.1 (*)
│   │   │   ├── aws-smithy-checksums v0.60.9
│   │   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   ├── aws-smithy-http v0.60.8 (*)
│   │   │   ├── aws-smithy-json v0.60.7 (*)
│   │   │   ├── aws-smithy-runtime v1.5.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.6.3 (*)
│   │   │   ├── aws-smithy-types v1.2.0 (*)
│   │   │   ├── aws-smithy-xml v0.60.8 (*)
│   │   │   ├── aws-types v1.3.1 (*)
│   │   ├── aws-smithy-types v1.2.0 (*)

Environment details (OS name and version, etc.)

Container (Ubuntu 20.04) on Amazon Linux 2023

Logs

No response

Hi @sticnarf, since I have not been able to reproduce the issue on my end, can you turn on tracing_subscriber like the following (DEBUG level specified by default)

    use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
 
    tracing_subscriber::registry()
    .with(tracing_subscriber::EnvFilter::new(
        std::env::var("RUST_LOG").unwrap_or_else(|_| "debug".into()),
    ))
    .with(tracing_subscriber::fmt::layer())
    .init();

and share with us logs that may possibly explain why retries did not take place (when sharing logs, please be careful and exclude your sensitive information such as security token from the logs)?

I've enabled debug log but fail reproduce the issue so far. I guess it only happens in rare cases like S3 partition splits (because when I encountered the issue the bucket is nearly empty).

Instead, I turn to looking at the SDK code. I look through all implementations of ClassifyRetry, it seems to me that none of them treats the 200 response with InternalError as RetryIndicated:

  • AwsErrorCodeClassifier: neither THROTTLING_ERRORS nor TRANSIENT_ERRORS includes InternalError
  • ModeledAsRetryableClassifier: CompleteMultipartUploadError does not provide retryable_error_kind
  • HttpStatusCodeClassifier: TRANSIENT_ERROR_STATUS_CODES does not include 200
  • TransientErrorClassifier: it's a service error, not handled by this classifier

But https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_Example_4 gives an example of an error response with 200 status code.

The documentation also mentions such errors are all retryable:

Note that if CompleteMultipartUpload fails, applications should be prepared to retry any failed requests (including 500 error responses). For more information, see Amazon S3 Error Best Practices.

In aws-sdk-go-v2, there is a middleware that transforms the 200 status to 500 if there is any error: https://github.com/aws/aws-sdk-go-v2/blob/177d1a4d275716764f8bfb9b020b4b6fbf157da9/service/s3/internal/customizations/handle_200_error.go

So, the way to handle these errors will be similar to a 500 error (similar retry strategies).

However, I cannot find a similar processing in aws-sdk-rust. smithy-lang/smithy-rs#2958 only supports converting such a response to an error response, but it doesn't seem to make it well-handled in the retry mechanism.

However, I cannot find a similar processing in aws-sdk-rust. smithy-lang/smithy-rs#2958 only supports converting such a response to an error response, but it doesn't seem to make it well-handled in the retry mechanism.

I think your observation is correct. The de-serializer for CompleteMultiUpload alters the execution flow from a success to an error for a response with both the 200 status code and the Error root element, but the knowledge of the Error root element will be lost by the time a retry classifier kicks in (HttpStatusCodeClassifier by default for an S3 client).

The fix has been released yesterday. Although we added a test on our end, since we couldn't reproduce it on our end, let us know if the fix doesn't improve the behavior you're observing.

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.