S3 `complete_multipart_upload` are not retried automatically on internal error
sticnarf opened this issue · 6 comments
Describe the bug
When I call complete_multipart_upload
to complete an S3 multipart upload, occasionally I get an internal error thrown from S3 service.
And I set a quite large limit for the auto retry:
let retry_config = RetryConfig::standard()
.with_max_attempts(128)
.with_initial_backoff(Duration::from_millis(200))
.with_max_backoff(Duration::from_millis(5000));
Expected Behavior
The complete_multipart_upload
request always succeeds.
Current Behavior
Sometimes, I get errors like this. The error shows up quickly after complete_multipart_upload
is called, obviously not exceeding the retry limit.
ServiceError(ServiceError { source: Unhandled(Unhandled { source: ErrorMetadata { code: Some("InternalError"), message: Some("We encountered an internal error. Please try again."), extras: Some({"s3_extended_request_id": "SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==", "aws_request_id": "RBHQNEJ7EJB02AP6"}) }, meta: ErrorMetadata { code: Some("InternalError"), message: Some("We encountered an internal error. Please try again."), extras: Some({"s3_extended_request_id": "SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==", "aws_request_id": "RBHQNEJ7EJB02AP6"}) } }), raw: Response { status: StatusCode(200), headers: Headers { headers: {"x-amz-id-2": HeaderValue { _private: H0("SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==") }, "x-amz-request-id": HeaderValue { _private: H0("RBHQNEJ7EJB02AP6") }, "date": HeaderValue { _private: H0("Tue, 11 Jun 2024 03:48:43 GMT") }, "x-amz-server-side-encryption": HeaderValue { _private: H0("AES256") }, "content-type": HeaderValue { _private: H0("application/xml") }, "transfer-encoding": HeaderValue { _private: H0("chunked") }, "server": HeaderValue { _private: H0("AmazonS3") }} }, body: SdkBody { inner: Once(Some(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>RBHQNEJ7EJB02AP6</RequestId><HostId>SgvXS40iED78LIUYyXRInmIANhnrYzdJCEI7/5cJnCkz4zKSoreguNI3UERBucyDzzIPrnnBaNkBxFGuPY0nOw==</HostId></Error>")), retryable: true }, extensions: Extensions { extensions_02x: Extensions, extensions_1x: Extensions } } })
I'm not sure if the 200
HTTP status makes the SDK not retry automatically.
Reproduction Steps
The code generally looks like this. It's almost the same as the example code.
const UPLOAD_TTL_SECS: i64 = 24 * 60 * 60; // 1 day
let expires = DateTime::from_secs(
SystemTime::now()
.duration_since(UNIX_EPOCH)
.unwrap()
.as_secs() as i64
+ UPLOAD_TTL_SECS,
);
let res = self
.client
.create_multipart_upload()
.bucket(&self.bucket)
.key(key.clone())
.expires(expires)
.send()
.await?;
let upload_id = res.upload_id.unwrap();
const CHUNK_SIZE: u64 = 64 * 1024 * 1024; // 64 MiB
const UPLOAD_BUF_SIZE: usize = 128 * 1024; // 128 KiB
let file_size = tokio::fs::metadata(local_path).await?.len();
let mut upload_sets = JoinSet::new();
let mut part_num = 1;
loop {
let offset = (part_num - 1) * CHUNK_SIZE;
if offset >= file_size {
break;
}
let len = CHUNK_SIZE.min(file_size - offset);
let fs_builder = ByteStream::read_from()
.path(local_path)
.offset(offset)
.length(Length::Exact(len))
.buffer_size(UPLOAD_BUF_SIZE);
let upload = self
.client
.upload_part()
.bucket(&self.bucket)
.key(key.clone())
.part_number(part_num as i32)
.upload_id(upload_id.clone());
upload_sets.spawn(async move {
let data = fs_builder.build().await?;
let output = upload.body(data).send().await?;
let complete_part = CompletedPart::builder()
.set_e_tag(output.e_tag)
.part_number(part_num as i32)
.build();
Ok::<_, DataError>(complete_part)
});
part_num += 1;
}
let mut parts = Vec::new();
while let Some(res) = upload_sets.join_next().await {
parts.push(res??);
}
parts.sort_by_key(|p| p.part_number());
let _output = self
.client
.complete_multipart_upload()
.bucket(&self.bucket)
.key(key)
.upload_id(upload_id)
.multipart_upload(
CompletedMultipartUpload::builder()
.set_parts(Some(parts))
.build(),
)
.send()
.await?;
Possible Solution
No response
Additional Information/Context
No response
Version
│ │ ├── aws-config v1.5.1
│ │ │ ├── aws-credential-types v1.2.0
│ │ │ │ ├── aws-smithy-async v1.2.1
│ │ │ │ ├── aws-smithy-runtime-api v1.6.3
│ │ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ │ ├── aws-smithy-types v1.2.0
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ ├── aws-runtime v1.2.2
│ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ ├── aws-sigv4 v1.2.1
│ │ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ │ ├── aws-smithy-eventstream v0.60.4
│ │ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ │ ├── aws-smithy-http v0.60.8
│ │ │ │ │ │ ├── aws-smithy-eventstream v0.60.4 (*)
│ │ │ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ ├── aws-smithy-eventstream v0.60.4 (*)
│ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-types v1.3.1
│ │ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ ├── aws-sdk-sso v1.29.0
│ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ ├── aws-runtime v1.2.2 (*)
│ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ ├── aws-smithy-json v0.60.7
│ │ │ │ │ └── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-smithy-runtime v1.5.7
│ │ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-types v1.3.1 (*)
│ │ │ ├── aws-sdk-ssooidc v1.30.0
│ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ ├── aws-runtime v1.2.2 (*)
│ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ ├── aws-smithy-json v0.60.7 (*)
│ │ │ │ ├── aws-smithy-runtime v1.5.7 (*)
│ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-types v1.3.1 (*)
│ │ │ ├── aws-sdk-sts v1.29.0
│ │ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ │ ├── aws-runtime v1.2.2 (*)
│ │ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ ├── aws-smithy-json v0.60.7 (*)
│ │ │ │ ├── aws-smithy-query v0.60.7
│ │ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-smithy-runtime v1.5.7 (*)
│ │ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ │ ├── aws-smithy-xml v0.60.8
│ │ │ │ ├── aws-types v1.3.1 (*)
│ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ ├── aws-smithy-json v0.60.7 (*)
│ │ │ ├── aws-smithy-runtime v1.5.7 (*)
│ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ ├── aws-types v1.3.1 (*)
│ │ ├── aws-sdk-s3 v1.34.0
│ │ │ ├── aws-credential-types v1.2.0 (*)
│ │ │ ├── aws-runtime v1.2.2 (*)
│ │ │ ├── aws-sigv4 v1.2.1 (*)
│ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ ├── aws-smithy-checksums v0.60.9
│ │ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ ├── aws-smithy-eventstream v0.60.4 (*)
│ │ │ ├── aws-smithy-http v0.60.8 (*)
│ │ │ ├── aws-smithy-json v0.60.7 (*)
│ │ │ ├── aws-smithy-runtime v1.5.7 (*)
│ │ │ ├── aws-smithy-runtime-api v1.6.3 (*)
│ │ │ ├── aws-smithy-types v1.2.0 (*)
│ │ │ ├── aws-smithy-xml v0.60.8 (*)
│ │ │ ├── aws-types v1.3.1 (*)
│ │ ├── aws-smithy-types v1.2.0 (*)
Environment details (OS name and version, etc.)
Container (Ubuntu 20.04) on Amazon Linux 2023
Logs
No response
Hi @sticnarf, since I have not been able to reproduce the issue on my end, can you turn on tracing_subscriber
like the following (DEBUG level specified by default)
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
tracing_subscriber::registry()
.with(tracing_subscriber::EnvFilter::new(
std::env::var("RUST_LOG").unwrap_or_else(|_| "debug".into()),
))
.with(tracing_subscriber::fmt::layer())
.init();
and share with us logs that may possibly explain why retries did not take place (when sharing logs, please be careful and exclude your sensitive information such as security token from the logs)?
I've enabled debug log but fail reproduce the issue so far. I guess it only happens in rare cases like S3 partition splits (because when I encountered the issue the bucket is nearly empty).
Instead, I turn to looking at the SDK code. I look through all implementations of ClassifyRetry
, it seems to me that none of them treats the 200
response with InternalError
as RetryIndicated
:
AwsErrorCodeClassifier
: neitherTHROTTLING_ERRORS
norTRANSIENT_ERRORS
includesInternalError
ModeledAsRetryableClassifier
:CompleteMultipartUploadError
does not provideretryable_error_kind
HttpStatusCodeClassifier
:TRANSIENT_ERROR_STATUS_CODES
does not include200
TransientErrorClassifier
: it's a service error, not handled by this classifier
But https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html#API_CompleteMultipartUpload_Example_4 gives an example of an error response with 200 status code.
The documentation also mentions such errors are all retryable:
Note that if CompleteMultipartUpload fails, applications should be prepared to retry any failed requests (including 500 error responses). For more information, see Amazon S3 Error Best Practices.
In aws-sdk-go-v2, there is a middleware that transforms the 200
status to 500
if there is any error: https://github.com/aws/aws-sdk-go-v2/blob/177d1a4d275716764f8bfb9b020b4b6fbf157da9/service/s3/internal/customizations/handle_200_error.go
So, the way to handle these errors will be similar to a 500 error (similar retry strategies).
However, I cannot find a similar processing in aws-sdk-rust. smithy-lang/smithy-rs#2958 only supports converting such a response to an error response, but it doesn't seem to make it well-handled in the retry mechanism.
However, I cannot find a similar processing in aws-sdk-rust. smithy-lang/smithy-rs#2958 only supports converting such a response to an error response, but it doesn't seem to make it well-handled in the retry mechanism.
I think your observation is correct. The de-serializer for CompleteMultiUpload
alters the execution flow from a success to an error for a response with both the 200 status code and the Error
root element, but the knowledge of the Error
root element will be lost by the time a retry classifier kicks in (HttpStatusCodeClassifier by default for an S3 client).
The fix has been released yesterday. Although we added a test on our end, since we couldn't reproduce it on our end, let us know if the fix doesn't improve the behavior you're observing.
Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.