[Bug]: Can't pull large image through the on demand sync
AlbanBedel opened this issue · 5 comments
zot version
v2.0.4
Describe the bug
When pulling large image index (aka multi arch images) a timeout is hit after 50 seconds and the sync in cancelled. In older versions of zot the download would continue in the background, so one would eventually get the image. Now the background sync is also cancelled and it is impossible to get the image.
To reproduce
- Configure the sync extension with one registry with
onDemand: true
- Try to inspect or pull an image with skopeo that take more than 50 seconds to download
- skopeo abort with:
received unexpected HTTP status: 504 Gateway Time-out
Expected behavior
It should be possible to download large images in pass-through mode. Ideally it should also be possible to get the image manifest without having to wait for the full image to be downloaded.
Screenshots
No response
Additional context
I can download the image directly from the original registry with skopeo without any issue so I don't think the timeout is in skopeo or the gitlab registry.
Skopeo log (domain replaced with example.com):
DEBU[0000] GET https://registry.example.com/v2/
DEBU[0000] Ping https://registry.example.com/v2/ status 200
DEBU[0000] GET https://registry.example.com/v2/embedded-sw/meta-lros/testing/manifests/2024.7.6
DEBU[0050] Content-Type from manifest GET is "text/html"
DEBU[0050] Accessing "registry.example.com/embedded-sw/meta-lros/testing:2024.7.6" failed: reading manifest 2024.7.6 in registry.example.com/embedded-sw/meta-lros/testing: received unexpected HTTP status: 504 Gateway Time-out
FATA[0050] Error parsing image name "docker://registry.example.com/embedded-sw/meta-lros/testing:2024.7.6": reading manifest 2024.7.6 in registry.example.com/embedded-sw/meta-lros/testing: received unexpected HTTP status: 504 Gateway Time-out
Zot log:
{"level":"info","module":"http","component":"session","clientIP":"10.42.0.96:33652","method":"GET","path":"/v2/","statusCode":200,"latency":"0s","bodySize":0,"headers":{"Accept-Encoding":["gzip"],"Docker-Distribution-Api-Version":["registry/2.0"],"User-Agent":["skopeo/1.13.2-dev"],"X-Forwarded-For":["10.42.0.1"],"X-Forwarded-Proto":["https"]},"goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/session.go:132","time":"2024-05-15T15:20:47.245754953Z","message":"HTTP API"}
{"level":"info","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/routes.go:1913","time":"2024-05-15T15:20:47.246517473Z","message":"trying to get updated image by syncing on demand"}
{"level":"info","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:481","time":"2024-05-15T15:20:47.246594127Z","message":"getting available client"}
{"level":"info","remote":"https://git.example.com:4567","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:283","time":"2024-05-15T15:20:47.25352728Z","message":"syncing image"}
{"level":"info","remote image":"git.example.com:4567/embedded-sw/meta-lros/testing:2024.7.6","local image":"embedded-sw/meta-lros/testing:2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:438","time":"2024-05-15T15:20:47.429342485Z","message":"syncing image"}
{"level":"error","error":"copying image 1/4 from manifest list: writing blob: happened during read: context canceled","errortype":"*fmt.wrapError","remote image":"git.example.com:4567/embedded-sw/meta-lros/testing:2024.7.6","local image":"embedded-sw/meta-lros/testing:2024.7.6","goroutine":69,"caller":"zotregistry.dev/zot/pkg/extensions/sync/service.go:451","time":"2024-05-15T15:21:37.301336505Z","message":"coulnd't sync image"}
{"level":"error","error":"copying image 1/4 from manifest list: writing blob: happened during read: context canceled","repository":"embedded-sw/meta-lros/testing","reference":"2024.7.6","goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/routes.go:1917","time":"2024-05-15T15:21:37.301603165Z","message":"failed to sync image"}
{"level":"info","module":"http","component":"session","clientIP":"10.42.0.96:33652","method":"GET","path":"/v2/embedded-sw/meta-lros/testing/manifests/2024.7.6","statusCode":404,"latency":"50s","bodySize":218,"headers":{"Accept":["application/vnd.oci.image.manifest.v1+json","application/vnd.docker.distribution.manifest.v2+json","application/vnd.docker.distribution.manifest.v1+prettyjws","application/vnd.docker.distribution.manifest.v1+json","application/vnd.docker.distribution.manifest.list.v2+json","application/vnd.oci.image.index.v1+json"],"Accept-Encoding":["gzip"],"Docker-Distribution-Api-Version":["registry/2.0"],"User-Agent":["skopeo/1.13.2-dev"],"X-Forwarded-For":["10.42.0.1"],"X-Forwarded-Proto":["https"]},"goroutine":58,"caller":"zotregistry.dev/zot/pkg/api/session.go:132","time":"2024-05-15T15:21:37.302120796Z","message":"HTTP API"}
Hello @AlbanBedel
Ok, I see, this should be easy to fix, thank you for raising this one, I'll come back with a fix.
Ideally it should also be possible to get the image manifest without having to wait for the full image to be downloaded.
^ this one is work in progress.
Thank you!
@AlbanBedel thanks for trying out zot and reporting this issue.
Ok, I tried to reproduce this one but it doesn't.
Tried with this image which has ~4gb:
skopeo inspect --tls-verify=false docker://localhost:8090/cimg/android:2024.04.1-ndk
Because I'm on a slow vpn it took 15 minutes to sync it. So there is no timeout in sync config that would close the connection.
The context cancelled error can happen only if the user cancels the on demand process (skopeo inspect in your case) or the client hits a timeout and bails out, or if the zot config file is updated and zot restarts internally all its components, sync included.
And because of the 504 error, I think you have an http server in front of zot? like an nginx and in this case you must increase its timeouts configuration.
You are right, my cluster use haproxy ingress which has a 50s inactivity timeout by default, so this bug can be closed. Nonetheless having an HTTP connection idle for such a long time is bound to create various problems, so I still hope this will be addressed at some point.
You are right, my cluster use haproxy ingress which has a 50s inactivity timeout by default, so this bug can be closed. Nonetheless having an HTTP connection idle for such a long time is bound to create various problems, so I still hope this will be addressed at some point.
Thank you! yes, we will address that.