proxy: Switch from `proxy.ttl=0` to `proxy.ttl > 0` leaks blobs that never get garbage collected
ialidzhikov opened this issue · 2 comments
Description
We are providing a service that deploys and manages pull through caches.
We are using the storage.delete.enabled
option to denote whether the "garbage collection of images" for the pull through cache is enabled or not.
This setting on our side is mutable - users can control whether the GC (under the hood the storage.delete.enabled
option) is enabled or not.
However there is the following case where the migration doesn't behave as expected. If you run a proxy with storage.delete.enabled=false
, then it adds entries to the scheduler-state.json file, after ttl is passed, it tries to remove the blobs/manifests. When storage.delete.enabled=false
blob doesn't get removed but the corresponding entry from the scheduler-state.json file gets removed. Hence, the corresponding blob could never be garbage collected if proxy is restarted to run with storage.delete.enabled=true
.
Reproduce
-
Run a registry in proxy mode with upstream for and with GC disabled (
storage.delete.enabled=false
). The following config is used: -
Pull the
alpine:3.14.0
image from the registry proxy.
# ctr -n k8s.io images pull docker.io/library/alpine:3.14.0
Make sure your scheduler-state.json
file looks similar to:
{
"library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402": {
"Key": "library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402",
"ExpiryData": "2024-01-17T10:40:19.747823626Z",
"EntryType": 0
},
"library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48": {
"Key": "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48",
"ExpiryData": "2024-01-17T10:40:18.48521875Z",
"EntryType": 1
},
"library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548": {
"Key": "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548",
"ExpiryData": "2024-01-17T10:40:18.959229542Z",
"EntryType": 1
},
"library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf": {
"Key": "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf",
"ExpiryData": "2024-01-17T10:40:20.352161126Z",
"EntryType": 0
}
}
- Edit the
scheduler-state.json
file: set one of the blob layers to expire for by updating the ExpiryData field. Let's do it for thelibrary/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402
blob.
Then restart the proxy. When the expiry data is passed, then it should try to delete the blob:
time="2024-01-10T10:45:19.748260501Z" level=error msg="Scheduler error returned from OnExpire(library/alpine@sha256:1dc785547989b0db1c3cd9949c57574393e69bea98bfe044b0588e24721aa402): operation unsupported" go.version=go1.20.8 instance.id=1fceb0ff-4044-4e41-8119-0584e06cdba4 service=registry version=2.8.3
The blob is as expected not removed because of the storage.delete.enabled=false
option.
However the corresponding entry gets removed from the scheduler-state.json
.
{
"library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48": {
"Key": "library/alpine@sha256:51b67269f354137895d43f3b3d810bfacd3945438e94dc5ac55fdac340352f48",
"ExpiryData": "2024-01-17T10:40:18.48521875Z",
"EntryType": 1
},
"library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548": {
"Key": "library/alpine@sha256:a70bcfbd89c9620d4085f6bc2a3e2eef32e8f3cdf5a90e35a1f95dcbd7f71548",
"ExpiryData": "2024-01-17T10:40:18.959229542Z",
"EntryType": 1
},
"library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf": {
"Key": "library/alpine@sha256:c303524923177661067f7eb378c3dd5277088c2676ebd1cd78e68397bb80fdbf",
"ExpiryData": "2024-01-17T10:40:20.352161126Z",
"EntryType": 0
}
}
At this point the blob leaks on the file system and the proxy can no longer garbage collect it if storage.delete.enabled=true
gets set to true in the future.
The corresponding source code is
distribution/registry/proxy/scheduler/scheduler.go
Lines 176 to 206 in 3dda067
Expected behavior
I would expect the switch from storage.delete.enabled=false
to storage.delete.enabled=true
to garbage collect all blobs/manifests on time.Now() + ttl
where time.Now() is the switch to storage.delete.enabled=true
(enablement of the garbage collection).
registry version
# registry --version
registry github.com/docker/distribution 2.8.3
Additional Info
No response
I also wanted to chat about the storage.delete.enabled=false
field. I feel that it is not designed for the purpose we use it. The source code in proxy gives the impression that it is designed to run always with GC enabled.
If proxy has the strict requirement storage.delete.enabled
to be always true, then this can be validated on startup and proxy could fail to start if the field is set to false.
On the other hand, I think it is useful to have a setting that allows to control whether GC is enabled or not.