opencontainers/distribution-spec

performance: consider relaxing chunks should be in order for patched uploads

rchincha opened this issue · 9 comments

The claim is that upload performance is improved when blob chunks are uploaded in parallel.

https://github.com/opencontainers/distribution-spec/blob/main/spec.md#pushing-a-blob-in-chunks

Therefore, the following language must be amended/relaxed.

"Chunks MUST be uploaded in order, with the first byte of a chunk being the last chunk's plus one. If a chunk is uploaded out of order, the registry MUST respond with a 416 Requested Range Not Satisfiable code. A GET request may be used to retrieve the current valid offset and upload location."

Do you have a proposed change to the wording? Even better if there's an example implementation.

I'm worried about how this would impact a streaming push where the source is unable to seek to recover previously sent data. If the client doesn't know a chunk wasn't received correctly, it may discard a chunk from the source after the send attempt and only discover the failure after sending the rest of the blob.

IMO, nothing can be guaranteed on the registry side until the final PUT (?digest=)
The registry may choose to compute the digest internally and then match with this client-specified digest.

In which case the language could change from ...

"Chunks MUST be uploaded in order, with the first byte of a chunk being the last chunk's plus one. If a chunk is uploaded out of order, the registry MUST respond with a 416 Requested Range Not Satisfiable code. A GET request may be used to retrieve the current valid offset and upload location."

to completely remove this language, or ...

"Chunks MAY be uploaded in any order and the registry MUST respond with a 202 code. A GET request may be used to retrieve the current valid offset and upload location."

  1. Client POST gets a session id, so private to that client
  2. Client PATCH in parallel (so can go out of order) against session id in 1.
  3. Client GET will likely get latest/largest offset returned from the registry, so either the chunk that a thread uploaded or more
  4. Client PUT with digest allows registry to now validate content

This still feels like it's going to break a lot of things to me. I'd like to see some implementations showing both the feasibility and a measured performance improvement.

It seems like "If a chunk is uploaded out of order, the registry MUST respond with a 416 Requested Range Not Satisfiable code." was intended to allow this to be supported at some point (or at least make it really clear that it isn't), so any new wording would have to include something like registries MAY respond with 416 if out of order uploads are not supported (anything else can't really work with existing implementations, especially as they're currently mandated by the spec to explicitly deny this use case).

I would imagine hashing the end result is a lot more annoying if the chunks are out of order, since with them in order the server can calculate as they're received (and that's likely the real reason for the current very strong language).

The challenges depend on the scenarios:

  • old client, old registry: this is the current state, no change, no impact.
  • new client, old registry: the client needs to know not to try to push concurrent chunks out of order, or handle the error from the registry gracefully.
  • old client, new registry: if chunks are dropped, we need to give the old client expected errors that allow it to recover.
  • new client, new registry: the client needs to be aware that it's talking to a registry that supports concurrent out of order chunks so that it can leverage the functionality.

So the language can be relaxed, but I believe there needs to be communication between the parties to avoid breaking one or more scenarios.

In a DC setting, from benchmarks I have seen, even for large buffers, for fairly modern Intel Xeons, sha256 per core << 1GiBps (= ~10Gbps), and ssd and network speeds will exceed this easily. So indeed that since sha256 is serial, hashing can become a bottleneck.

Note that this may not qualify as the "average" case, but for large artifacts use case, it does make sense to consider this.

Also note that if want streamed hashing and take advantage of pipelining of chunk uploading, we want hashing to be faster in that case.

For some chunk C_k, k in [1, N] ...

Today,

client push single C_k -> registry hash C_k -> client wait response C_k and push next chunk but registry is waiting also -> lather, rinse, repeat -> client finalizes and registry verifies

Pipeline parallelism,

client push parallel many/all C_k -> registry hash whatever C_k in sequence, may stall -> client wait response for many C_k, push many other C_k but registry kept busy with hashing -> lather, rinse, repeat -> client finalizes and registry verifies

If extremely lucky, all C_k arrive in exact sequence at registry side. Pathological case, they arrive in opposite order, so stall till the very end and worse than the serial case.

Registry side make a choice to either store these chunks in buffers or storage.

Client side also computes the hash for finalize step but it is an easier problem since it has more control.

In a DC setting, from benchmarks I have seen, even for large buffers, for fairly modern Intel Xeons, sha256 per core << 1GiBps (= ~10Gbps), and ssd and network speeds will exceed this easily. So indeed that since sha256 is serial, hashing can become a bottleneck.

Note that this may not qualify as the "average" case, but for large artifacts use case, it does make sense to consider this.

Also note that if want streamed hashing and take advantage of pipelining of chunk uploading, we want hashing to be faster in that case.

From the above statements, the bottleneck is the hashing speed. If we have a hash algorithm, of which speed is faster than the SSD and the network speed, sequential uploading (current method) and parallel uploading (proposed out of order chunk upload) should have the same performance.

From the above statements, the bottleneck is the hashing speed

^ but that is assuming the registry wants to hash. IIRC, some registries just accept whatever is sent their way, so that is a secondary concern. So the question now is can parallel range uploads be faster somehow?