[Bug]: Can't pull large image through the on demand sync

Question

[Bug]: Can't pull large image through the on demand sync

AlbanBedel opened this issue 8 months ago · 5 comments

zot version

v2.0.4

Describe the bug

When pulling large image index (aka multi arch images) a timeout is hit after 50 seconds and the sync in cancelled. In older versions of zot the download would continue in the background, so one would eventually get the image. Now the background sync is also cancelled and it is impossible to get the image.

To reproduce

Configure the sync extension with one registry with onDemand: true
Try to inspect or pull an image with skopeo that take more than 50 seconds to download
skopeo abort with: received unexpected HTTP status: 504 Gateway Time-out

Expected behavior

It should be possible to download large images in pass-through mode. Ideally it should also be possible to get the image manifest without having to wait for the full image to be downloaded.

Screenshots

No response

Additional context

I can download the image directly from the original registry with skopeo without any issue so I don't think the timeout is in skopeo or the gitlab registry.

Skopeo log (domain replaced with example.com):

DEBU[0000] GET https://registry.example.com/v2/ 
DEBU[0000] Ping https://registry.example.com/v2/ status 200 
DEBU[0000] GET https://registry.example.com/v2/embedded-sw/meta-lros/testing/manifests/2024.7.6 
DEBU[0050] Content-Type from manifest GET is "text/html" 
DEBU[0050] Accessing "registry.example.com/embedded-sw/meta-lros/testing:2024.7.6" failed: reading manifest 2024.7.6 in registry.example.com/embedded-sw/meta-lros/testing: received unexpected HTTP status: 504 Gateway Time-out 
FATA[0050] Error parsing image name "docker://registry.example.com/embedded-sw/meta-lros/testing:2024.7.6": reading manifest 2024.7.6 in registry.example.com/embedded-sw/meta-lros/testing: received unexpected HTTP status: 504 Gateway Time-out

Zot log:

{"level":"info","module":"http","component":"session","clientIP":"10.42.0.96:33652","method":"GET","path":"/v2/","statusCode":200,"latency":"0s","bodySize":0,"headers":{"Accept-Encoding":["gzip"],"Docker-Distribution-Api-Version":["registry/2.0"],"User-Agent":["skopeo/1.13.2-dev"],"X-Forwarded-For":["10.42.0.1"],"X-Forwarded-Proto":["https"]},"goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/session.go:132","time":"2024-05-15T15:20:47.245754953Z","message":"HTTP API"}
{"level":"info","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/routes.go:1913","time":"2024-05-15T15:20:47.246517473Z","message":"trying to get updated image by syncing on demand"}
{"level":"info","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:481","time":"2024-05-15T15:20:47.246594127Z","message":"getting available client"}
{"level":"info","remote":"https://git.example.com:4567","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:283","time":"2024-05-15T15:20:47.25352728Z","message":"syncing image"}
{"level":"info","remote image":"git.example.com:4567/embedded-sw/meta-lros/testing:2024.7.6","local image":"embedded-sw/meta-lros/testing:2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:438","time":"2024-05-15T15:20:47.429342485Z","message":"syncing image"}
{"level":"error","error":"copying image 1/4 from manifest list: writing blob: happened during read: context canceled","errortype":"*fmt.wrapError","remote image":"git.example.com:4567/embedded-sw/meta-lros/testing:2024.7.6","local image":"embedded-sw/meta-lros/testing:2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:451","time":"2024-05-15T15:21:37.301336505Z","message":"coulnd't sync image"}
{"level":"error","error":"copying image 1/4 from manifest list: writing blob: happened during read: context canceled","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/routes.go:1917","time":"2024-05-15T15:21:37.301603165Z","message":"failed to sync image"}
{"level":"info","module":"http","component":"session","clientIP":"10.42.0.96:33652","method":"GET","path":"/v2/embedded-sw/meta-lros/testing/manifests/2024.7.6","statusCode":404,"latency":"50s","bodySize":218,"headers":{"Accept":["application/vnd.oci.image.manifest.v1+json","application/vnd.docker.distribution.manifest.v2+json","application/vnd.docker.distribution.manifest.v1+prettyjws","application/vnd.docker.distribution.manifest.v1+json","application/vnd.docker.distribution.manifest.list.v2+json","application/vnd.oci.image.index.v1+json"],"Accept-Encoding":["gzip"],"Docker-Distribution-Api-Version":["registry/2.0"],"User-Agent":["skopeo/1.13.2-dev"],"X-Forwarded-For":["10.42.0.1"],"X-Forwarded-Proto":["https"]},"goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/session.go:132","time":"2024-05-15T15:21:37.302120796Z","message":"HTTP API"}

Answer 1 · 2024-05-15T15:53:34.000Z

Hello @AlbanBedel

Ok, I see, this should be easy to fix, thank you for raising this one, I'll come back with a fix.

Ideally it should also be possible to get the image manifest without having to wait for the full image to be downloaded.

^ this one is work in progress.

Thank you!

Answer 2 · 2024-05-15T16:22:54.000Z

@AlbanBedel thanks for trying out zot and reporting this issue.

Answer 3 · 2024-05-15T16:57:17.000Z

Ok, I tried to reproduce this one but it doesn't.

Tried with this image which has ~4gb:
skopeo inspect --tls-verify=false docker://localhost:8090/cimg/android:2024.04.1-ndk

Because I'm on a slow vpn it took 15 minutes to sync it. So there is no timeout in sync config that would close the connection.

The context cancelled error can happen only if the user cancels the on demand process (skopeo inspect in your case) or the client hits a timeout and bails out, or if the zot config file is updated and zot restarts internally all its components, sync included.

And because of the 504 error, I think you have an http server in front of zot? like an nginx and in this case you must increase its timeouts configuration.

Answer 4 · 2024-05-16T09:45:34.000Z

You are right, my cluster use haproxy ingress which has a 50s inactivity timeout by default, so this bug can be closed. Nonetheless having an HTTP connection idle for such a long time is bound to create various problems, so I still hope this will be addressed at some point.

Answer 5 · 2024-05-16T11:09:29.000Z

You are right, my cluster use haproxy ingress which has a 50s inactivity timeout by default, so this bug can be closed. Nonetheless having an HTTP connection idle for such a long time is bound to create various problems, so I still hope this will be addressed at some point.

Thank you! yes, we will address that.