aws/aws-sdk-cpp

There is no possibility to omit double file reading caused by checksum calculation

ThatEmbeddedGuy opened this issue · 4 comments

Describe the bug

Even if
Aws::S3::Model::ChecksumAlgorithm::NOT_SET

object_request.WithBucket(bucket).WithKey(s3key).WithChecksumAlgorithm(Aws::S3::Model::ChecksumAlgorithm::NOT_SET);
              object_request.SetBody(stdstream);

SDK rewrites it to MD5

Aws::String PutObjectRequest::GetChecksumAlgorithmName() const
{
  if (m_checksumAlgorithm == ChecksumAlgorithm::NOT_SET)
  {
    return "md5";
  }
  else
  {
    return ChecksumAlgorithmMapper::GetNameForChecksumAlgorithm(m_checksumAlgorithm);
  }
} 

https://github.com/aws/aws-sdk-cpp/blob/main/generated/src/aws-cpp-sdk-s3/source/model/PutObjectRequest.cpp#L329

It leads to checksum calculation which is passed in the header

content-md5: aOEJ8PQMpyoV4FzCJ4b45g==

full log from curl :

[DEBUG] CURL: (HeaderOut) PUT /XXXXXXXX HTTP/1.1
Host: XXXXX.s3.ap-southeast-1.amazonaws.com
Accept: /
amz-sdk-invocation-id: E01AF2D1-E5F1-4469-B02F-76304CFB417F
amz-sdk-request: attempt=1
authorization: AWS4-HMAC-SHA256 Credential=XXX/XXX/ap-southeast-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-request;content->length;content-md5;content-type;host;x-amz-content-sha256;x-amz-date, Signature=35038b4005c0f8ec52b6084c2cd84461408ae0ed1056e4b3788543e7009e5360
content-length: 10
content-md5: aOEJ8PQMpyoV4FzCJ4b45g==
content-type: binary/octet-stream
user-agent: aws-sdk-cpp/1.9.310 Linux/5.15.0-202.135.2.el9uek.x86_64 x86_64 GCC/11.4.1
x-amz-content-sha256: UNSIGNED-PAYLOAD
x-amz-date: 20240425T125503Z
Expect: 100-continue

Headers are calculated before factual sending of file, therefore data has been read 2 times. One time for calculating checksumm and one for factual sending

Related issue was considered as bug there
#711

Expected Behavior

I should have possibility to disable twice data readings caused by checksum calculations

Current Behavior

Checksum calculation is passed in the header and is calculated before request

content-md5: aOEJ8PQMpyoV4FzCJ4b45g==

full log from curl :

[DEBUG] CURL: (HeaderOut) PUT /XXXXXXXX HTTP/1.1
Host: XXXXX.s3.ap-southeast-1.amazonaws.com
Accept: /
amz-sdk-invocation-id: E01AF2D1-E5F1-4469-B02F-76304CFB417F
amz-sdk-request: attempt=1
authorization: AWS4-HMAC-SHA256 Credential=XXX/XXX/ap-southeast-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-request;content->length;content-md5;content-type;host;x-amz-content-sha256;x-amz-date, Signature=35038b4005c0f8ec52b6084c2cd84461408ae0ed1056e4b3788543e7009e5360
content-length: 10
content-md5: aOEJ8PQMpyoV4FzCJ4b45g==
content-type: binary/octet-stream
user-agent: aws-sdk-cpp/1.9.310 Linux/5.15.0-202.135.2.el9uek.x86_64 x86_64 GCC/11.4.1
x-amz-content-sha256: UNSIGNED-PAYLOAD
x-amz-date: 20240425T125503Z
Expect: 100-continue

Headers are calculated before factual sending of file, therefore data has been read 2 times. One for calculating checksumm, one for factual sending

Reproduction Steps

Standard example from aws sdk with ChecksumAlgorithm::NOT_SET

bool AwsDoc::S3::PutObject(const Aws::String &bucketName,
                           const Aws::String &fileName,
                           const Aws::Client::ClientConfiguration &clientConfig) {
    Aws::S3::S3Client s3_client(clientConfig);

object_request.WithBucket(bucket).WithKey(s3key).WithChecksumAlgorithm(Aws::S3::Model::ChecksumAlgorithm::NOT_SET);

    std::shared_ptr<Aws::IOStream> inputData =
            Aws::MakeShared<Aws::FStream>("SampleAllocationTag",
                                          fileName.c_str(),
                                          std::ios_base::in | std::ios_base::binary);

    if (!*inputData) {
        std::cerr << "Error unable to read file " << fileName << std::endl;
        return false;
    }

    request.SetBody(inputData);

    Aws::S3::Model::PutObjectOutcome outcome =
            s3_client.PutObject(request);

    if (!outcome.IsSuccess()) {
        std::cerr << "Error: PutObject: " <<
                  outcome.GetError().GetMessage() << std::endl;
    }
    else {
        std::cout << "Added object '" << fileName << "' to bucket '"
                  << bucketName << "'.";
    }

    return outcome.IsSuccess();
}

Possible Solution

Doesn't rewrite checksum algorythm as MD5 when is passed as ChecksumAlgorithm::NOT_SET

Additional Information/Context

No response

AWS CPP SDK version used

1.9.310

Compiler and Version used

GCC/11.4.1

Operating System and version

Oracle Linux 9

I suppose original issue is caused by misinterpretation of NOT_SET value. It can be interpreted as explicitly not set and not set just in case, therefore we need to set default value

I propose to add something like ChecksumAlgorithm::DISABLED
which is explicitly disables checksum calculation

Yes, this is a known issue and something we are trying to figure out how to address. Related ticket is #2933.
In short, with the release of additional checksums for S3 in 1.9.x, checksums were made required by default on put, so either old md5 checksum or the new additional checksum must be present. ChecksumAlgorithm::NOT_SET adds some confusion, as it does not mean that no checksum should be used, it means that additional checksum is not set and SDK should fallback to md5.
With that said, all additional checksums are calculated as body is streamed and are sent in the trailer, so the body will only be read once.

Closing as duplicate of this feature request: #2933

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.