Azure/blobporter

Content-MD5 .... not quite there...

udf2457 opened this issue · 6 comments

re: #51

Sorry to flag this one up again but "-m" does not appear to be working as advertised ?

$ blobporter -q -m -c test -f a -n test
BlobPorter 
Copyright (c) Microsoft Corporation. 
Version: 0.5.02
---------------
Transfer Task: file-blockblob
Files to Transfer:
Source: a Size:2 

The process took 124.574686ms to run.
Throughput: 0.00 MB/s (0.00 Mb/s) 
Configuration: R=24, W=36, DataSize=2KiB (2), Blocks=1
Cumulative Writes Duration: Total=28.659118ms, Avg Per Worker=796.086µs
Retries: Avg=0 Total=0

Yields:

<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://my***account.blob.core.windows.net/" ContainerName="test">
<Blobs>
    <Blob>
        <Name>test</Name>
        <Properties>
            <Last-Modified>Sat, 13 May 2017 09:10:33 GMT</Last-Modified>
            <Etag>0x8D499DFE8F46B23</Etag>
            <Content-Length>2</Content-Length>
            <Content-Type>application/octet-stream</Content-Type>
            <Content-Encoding/>
            <Content-Language/>
            <Content-MD5/>
            <Cache-Control/>
            <Content-Disposition/>
            <BlobType>BlockBlob</BlobType>
            <LeaseStatus>unlocked</LeaseStatus>
            <LeaseState>available</LeaseState>
        </Properties>
    </Blob>
</Blobs>
<NextMarker/>
</EnumerationResults>

As you can see, the Content-MD5 element is empty, which would not be the case if you were really sending MD5s ....

I suspect what might be happening is:

  • You are (hopefully) sending Content-MD5 when sending blocks via Put Block
  • You are (hopefully) sending Content-MD5 when sending Put Block List
  • You are (probably) forgetting x-ms-blob-content-md5 when sending Put Block List
  • The same applies for Put Blob

From the docs (https://docs.microsoft.com/en-us/rest/api/storageservices/put-block-list):

x-ms-blob-content-md5: Optional. An MD5 hash of the blob content. Note that this hash is not validated, as the hashes for the individual blocks were validated when each was uploaded.

The Get Blob operation returns the value of this header in the Content-MD5 response header.

If this property is not specified with the request, then it is cleared for the blob if the request is successful.

Thanks for the follow up. BlobPorter uses blocks for all the transfers, this is what allows the high level of concurrency and maximizes throughput. The current implementation calculates a block level MD5 (option -m), which addresses the concern of data integrity during transfer - this value is validated by the storage backend.

As you've pointed out, the content md5 (whole blob) is not validated by Azure storage. Considering this and the fact that computing a blob wide MD5 would be a resource intensive pre-processing step (must be done sequentially and prior to the transfer), little value would be provided while affecting the overall transfer time.

As an alternative, we are considering an approach where you can pre-calculate the MD5, using the tool of your own choosing, prior to the transfer and then you can pass it to BlobPorter. In effect, treating this value as a metadata item -which is what, technically, this value becomes when it is not validated by the backend.

So basically keep -m and add an additional metadata parameter for the whole-blob version ? Sounds fair enough.

Correct, where the whole-blob version will be calculated outside blobporter -e.g. $md5sum file.

Updating this issue to point to this project that addresses the gap of not been able to calculate the MD5 hash for multi-block blobs https://github.com/giventocode/azure-blob-md5

ankku commented

just to clarify here, i am using logic app and have called this action get metadata for blob and I get the properties :

{
"Id": "Jtestakajsdjas==",
"Name": "test.json",
"DisplayName": "test.json",
"Path": "/resistor-v3/test.json",
"LastModified": "test",
"Size": 2480,
"MediaType": "application/octet-stream",
"IsFolder": false,
"ETag": """",
"FileLocator": "testskjhska",
"LastModifiedBy": null
}

I don't see content MD5 property in here, though i can see it when I go to my blob and right click to properties. Is this a default behavior of logic app?? how can i get the ContentMD5 property