google/go-containerregistry

ContentLength mismatch in blob upload

deitch opened this issue · 13 comments

I am having an intermittent issue with ContentLength mismatch. It may be registry-side related, I cannot reproduce it reliably, but if I call a lot of pushes, it does show up with some regularity.

I ran with debug logging enabled, here is what I got, when pushing an image from linuxkit, filtered some of the irrelevant data out:

INFO[0555] index.docker.io/linuxkit/containerd@sha256:5dd3ad98ff8da74de6ceb9c609bb0b7302f4b305663507c38efc56b243b34c6f: digest: sha256:5dd3ad98ff8da74de6ceb9c609bb0b7302f4b305663507c38efc56b243b34c6f size: 1311
DEBU[0555] --> PUT https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866
DEBU[0555] PUT /v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866 HTTP/1.1
DEBU[0555] Host: index.docker.io
DEBU[0555] User-Agent: go-containerregistry/v0.4.1-0.20210208222243-cbafe638a7a9
DEBU[0555] Content-Length: 1821
DEBU[0555] Authorization: <redacted>
DEBU[0555] Content-Type: application/vnd.oci.image.index.v1+json
DEBU[0555] Accept-Encoding: gzip
DEBU[0555]
DEBU[0555] ... lots of json  ...
DEBU[0555] <-- 401 https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866 (156.197508ms)
DEBU[0555] HTTP/1.1 401 Unauthorized
DEBU[0555] Connection: close
DEBU[0555] Content-Length: 240
DEBU[0555] Content-Type: application/json
DEBU[0555] Date: Wed, 28 Apr 2021 09:15:23 GMT
DEBU[0555] Docker-Distribution-Api-Version: registry/2.0
DEBU[0555] Www-Authenticate: Bearer realm="https://auth.docker.io/token",service="registry.docker.io",scope="repository:linuxkit/containerd:pull,push",error="invalid_token"
DEBU[0555]
DEBU[0555] {"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":[{"Type":"repository","Class":"","Name":"linuxkit/containerd","Action":"pull"},{"Type":"repository","Class":"","Name":"linuxkit/containerd","Action":"push"}]}]}
DEBU[0555]
DEBU[0555] --> GET https://auth.docker.io/token?scope=repository%3Alinuxkit%2Fcontainerd%3Apush%2Cpull&scope=repository%3Alinuxkit%2Fcontainerd%3Apull%2Cpush&service=registry.docker.io [body redacted: basic token response contains credentials]
DEBU[0555] GET /token?scope=repository%3Alinuxkit%2Fcontainerd%3Apush%2Cpull&scope=repository%3Alinuxkit%2Fcontainerd%3Apull%2Cpush&service=registry.docker.io HTTP/1.1
DEBU[0555] Host: auth.docker.io
DEBU[0555] User-Agent: go-containerregistry/v0.4.1-0.20210208222243-cbafe638a7a9
DEBU[0555] Authorization: <redacted>
DEBU[0555] Accept-Encoding: gzip
DEBU[0555]
DEBU[0555]
DEBU[0556] <-- 200 https://auth.docker.io/token?scope=repository%3Alinuxkit%2Fcontainerd%3Apush%2Cpull&scope=repository%3Alinuxkit%2Fcontainerd%3Apull%2Cpush&service=registry.docker.io (695.513841ms) [body redacted: basic token response contains credentials]
DEBU[0556] HTTP/1.1 200 OK
DEBU[0556] Transfer-Encoding: chunked
DEBU[0556] Content-Type: application/json
DEBU[0556] Date: Wed, 28 Apr 2021 09:15:24 GMT
DEBU[0556] Strict-Transport-Security: max-age=31536000
DEBU[0556]
DEBU[0556]
DEBU[0556] --> PUT https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866
DEBU[0556] Failed to dump request PUT https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866: http: ContentLength=1821 with Body length 0
DEBU[0556] <-- http: ContentLength=1821 with Body length 0 PUT https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866 (553.890807ms)
Put "https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866": http: ContentLength=1821 with Body length 0

It tries to put a manifest, gets a 401, goes to authorize, goes back with it, and gets an error. It doesn't look like it is an http error, but I cannot tell from the logs.

Interestingly, that error line looks like it comes from here, but based on that code, it should not error out. Yet in this case it definitely does with a non-zero exit code (which I wish I had thought to capture).

When I reran, it went just fine:

DEBU[0023] --> PUT https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866
DEBU[0023] PUT /v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866 HTTP/1.1
DEBU[0023] Host: index.docker.io
DEBU[0023] User-Agent: go-containerregistry/v0.4.1-0.20210208222243-cbafe638a7a9
DEBU[0023] Content-Length: 1821
DEBU[0023] Authorization: <redacted>
DEBU[0023] Content-Type: application/vnd.oci.image.index.v1+json
DEBU[0023] Accept-Encoding: gzip
DEBU[0023]
DEBU[0023] ... lots of json content ...
DEBU[0023] <-- 201 https://index.docker.io/v2/linuxkit/containerd/manifests/cc02c2af9c928c2faeccbe4edc78bd297ad91866 (316.634923ms)
DEBU[0023] HTTP/1.1 201 Created
DEBU[0023] Content-Length: 0
DEBU[0023] Date: Wed, 28 Apr 2021 09:29:33 GMT
DEBU[0023] Docker-Content-Digest: sha256:ebd295492eff4e096bfeaa4f09cc4013c629a72c41056afee26a49419b47a75d
DEBU[0023] Docker-Distribution-Api-Version: registry/2.0
DEBU[0023] Location: https://index.docker.io/v2/linuxkit/containerd/manifests/sha256:ebd295492eff4e096bfeaa4f09cc4013c629a72c41056afee26a49419b47a75d
DEBU[0023] Strict-Transport-Security: max-age=31536000
DEBU[0023]
INFO[0023] linuxkit/containerd:cc02c2af9c928c2faeccbe4edc78bd297ad91866: digest: sha256:ebd295492eff4e096bfeaa4f09cc4013c629a72c41056afee26a49419b47a75d size: 1821

Hmm, I notice the failure to PUT the index occurred after pushing lots of other sub-manifests. It seems to have carried over the auth from one to the next for the others, but not for this one, hence the 401 and auth requirement.

In the retry, it first does an index at GET https://index.docker.io/v2/, which forces an auth, then does a bunch of HEAD for the sub-manifests, and then PUT for the index. None of the steps after the index in this case forced a re-auth. Related? Red herring? 🤷‍♂️

The reauth theory is pretty good actually.

If we hit an auth challenge, we perform the token exchange and retry the request in here.

I'm guessing if there's a request body, the body gets consumed or closed during the token handshake, and the subsequent retry fails. We should probably stash the body somewhere before retrying -- not sure how to do this most idiomatically.

I can't remember if I even tried to fix this. Do you still hit this ever?

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Keep fresh with the 'lifecycle/frozen' label.

Coming back to this, as I keep coming across it. I often will retry, and then it goes fine, or sometimes not.

Can this be a retry issue? I looked at the Moby source code, which just does 5 retries on errors, see here. I am not quite sure how our retries work?

I believe the problem is that we often retry failed requests without considering whether the body of the request has been consumed.

For blob uploads, we actually wrap the whole blob upload in some retry logic, so we probably mask that issue when we just restart the upload from scratch.

For manifest uploads, this was added somewhat recently: #1041

What version are you using for this?

It may also be the case that whatever error we're hitting because of the consumed request body doesn't pass our test of "shouldRetry"...

My theory is:

  1. Original request gets 401 and has to reauth, here.
  2. After we reauth, we retry that request, here.
  3. The body of the request in has already been consumed, so the second request fails.

Looking at http godoc, there are some interesting things to tug on: https://pkg.go.dev/net/http#Transport

Transport only retries a request upon encountering a network error if the request is idempotent and either has no body or has its Request.GetBody defined. HTTP requests are considered idempotent if they have HTTP methods GET, HEAD, OPTIONS, or TRACE; or if their Header map contains an "Idempotency-Key" or "X-Idempotency-Key" entry. If the idempotency key value is a zero-length slice, the request is treated as idempotent but the header is not sent on the wire.

We could use similar logic to the stdlib and start setting the GetBody method on our outgoing requests, as well as idempotency key headers, see here.

(This would have the bonus effect of getting the http stdlib transport to help with retrying requests)

We can do something similar to rewindBody after we auth: https://github.com/golang/go/blob/dbd3cf884986c88f5b3350709c0f51fa02330805/src/net/http/transport.go#L655-L676

For manifest uploads

Actually, the majority of the errors I see (anecdotally) are around manifests/indexes, not blobs. Everything is wrapped up in remote.Write() or remote.WriteIndex(), so I am working completely from memory and from log messages, but I am fairly confident that is the majority of what I am seeing.

this was added somewhat recently
What version are you using?

My go.mod shows v0.4.1-0.20210208222243-cbafe638a7a9 so clearly earlier than the above. I will try an update in any case to see if it helps.

My theory is:

I like that theory. Those messages when it fails (not all of the time, but most of it), with:

ContentLength=1821 with Body length 0

fits pretty well with the theory.

I spent more time than I probably should have digging into https://github.com/golang/go/blob/6178d25fc0b28724b1b5aec2b1b74fc06d9294c7/src/net/http/transfer.go#L388-L391

I cannot actually reproduce this failure to confirm my theory in any meaningful way with a unit test, but I did learn that golang will automagically set GetBody for certain request body types (which we fall into) so that the body can be rewound: https://github.com/golang/go/blob/6178d25fc0b28724b1b5aec2b1b74fc06d9294c7/src/net/http/request.go#L890-L912

Unfortunately, it won't replay requests that aren't "safe" unless you set a magic header: https://github.com/golang/go/blob/4d8db00641cc9ff4f44de7df9b8c4f4a4f9416ee/src/net/http/request.go#L1415-L1429

I'm going to do this for now for our PUT requests, which might help a bit?

Hey, @jonjohnsonjr did you end up implementing what you suggested above? (Adding the header to the PUT request?)
In imgpkg we are experiencing this issue when we are copying bundles that contain multiple large images to the registry. My assumption is that in the middle of the requests we do get a reauth and that is causing the same error reported here.

I did not! I think wrapping the the PUT in a retry solved this ~mostly:

tryUpload := func() error {

It is probably worth adding the header, if you're still seeing those failures. Do you have any way to repro or have logs?

Hey,
This is the amount of logs I have right now

2023/02/08 19:43:47 <-- net/http: timeout awaiting response headers PUT https://my.registry.io/v2/my-package/manifests/sha256-423d64a60197ecf4bab1283df883a902b5fea8efff08e4c1dd2cb8e1151dc195.imgpkg (30.445465629s)
2023/02/08 19:43:48 --> PUT https://my.registry.io/v2/my-package/manifests/sha256-423d64a60197ecf4bab1283df883a902b5fea8efff08e4c1dd2cb8e1151dc195.imgpkg
2023/02/08 19:43:48 Failed to dump request PUT https://my.registry.io/v2/my-package/manifests/sha256-423d64a60197ecf4bab1283df883a902b5fea8efff08e4c1dd2cb8e1151dc195.imgpkg: http: ContentLength=915 with Body length 0
2023/02/08 19:43:48 <-- http: ContentLength=915 with Body length 0 PUT https://my.registry.io/v2/my-package/manifests/sha256-423d64a60197ecf4bab1283df883a902b5fea8efff08e4c1dd2cb8e1151dc195.imgpkg (499.610204ms)
2023/02/08 19:43:48 <-- context canceled PUT https://my.registry.io/v2/my-package/manifests/sha256-db1299f35bf28ac1d90a4dd2c8398d398be7e744e879b4df870eddf4a0b065f6.imgpkg (25.670816509s)
2023/02/08 19:43:48 <-- context canceled PUT https://my.registry.io/v2/my-package/manifests/sha256-dbf0dedb6848ad8a7704c1c19465a1ddae9039b0e63c1dd0df3e2ed9cbda6093.imgpkg (884.515058ms)
2023/02/08 19:43:48 <-- context canceled PUT https://my.registry.io/v2/my-package/manifests/sha256-b93ccb81309e767f9850ad90dac5986c25a86822dda0c8faee48bf0ab7c8f353.imgpkg (10.506871586s)
2023/02/08 19:43:48 <-- context canceled PUT https://my.registry.io/v2/my-package/manifests/sha256-19d16a033e9605444f17e6066652ffbf4b10cef84e6e5846ecef78730cf4f2a0.imgpkg (20.406599514s)
copy |
copy | done uploading images

Thanks -- what version of go-containerregistry is that with?

Sorry I dropped the ball on this. We were fairly up to date so I think it was 0.13.0