Registry keeps crashing after starting "purgeuploads.go"
MXGong opened this issue ยท 30 comments
Description
We are building docker registry in ECS with Fargate, s3 as backend storage.
registry container keeps crashing after log shows "PurgeUploads starting"
Related debug log:
===========
Timestamp (UTC+10:00) | Message | Container |
---|---|---|
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage.PurgeUploads({0x1a383a0, 0xc000718780}, {0x1a403b8?, 0xc000708950?}, {0xc1887a9faa43c1bf, 0xfffddcd46f22bfb4, 0x24f3e60}, 0x1) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:34 +0x12d | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/handlers.startUploadPurger.func1() | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/handlers/app.go:1032 +0x33f | registry |
20 May 2024 at 11:55 (UTC+10:00) | created by github.com/distribution/distribution/v3/registry/handlers.startUploadPurger in goroutine 1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/handlers/app.go:1020 +0x329 | registry |
20 May 2024 at 11:55 (UTC+10:00) | panic: runtime error: index out of range [0] with length 0 | registry |
20 May 2024 at 11:55 (UTC+10:00) | goroutine 21 [running]: | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads.func1({0x1a384b8, 0xc001862c80}) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:73 +0x54c | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc00015cc60, 0x0?) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1159 +0x45c | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00049a690, {0x1a38870?, 0xc00046b570}, 0xc00046b500, 0xc001563a18, {0x0, 0x0, 0x0}) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/aws/aws-sdk-go@v1.48.10/service/s3/api.go:7629 +0x1d0 | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc0004bb170, {0x1a38870, 0xc00046b490}, 0xc000583ac0, {0xc00009a660?, 0x20?}, {0x0, 0x0}, 0xc00074e480) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1123 +0x447 | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).Walk(0xc00009a660?, {0x1a38870, 0xc00046b490}, {0xc00009a660, 0x20}, 0x155a420?, {0x0, 0x0, 0x7779f8?}) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1077 +0xdc | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk(0xc000708950, {0x1a383a0?, 0xc000718780?}, {0xc00009a660, 0x20}, 0x410885?, {0x0, 0x0, 0x0}) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/driver/base/base.go:236 +0x27f | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads({0x1a383a0, 0xc000718780}, {0x1a403b8?, 0xc000708950}) | registry |
20 May 2024 at 11:55 (UTC+10:00) | github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:70 +0x202 | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:18.997854099Z" level=debug msg="s3aws.ListObjectsV2PagesWithContext({\n Bucket: "s3-backend-storage",\n MaxKeys: 1000,\n Prefix: "docker/registry/v2/repositories/",\n StartAfter: ""\n})" go.version=go1.21.5 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry trace.duration=8.288609547s trace.file=github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=be2f0c8a-e911-4767-bf4a-cde149409695 trace.line=1111 trace.parent.id=64255554-37c4-4dee-aaf1-9aa693390ded version=3.0.0-alpha.1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:18.997899441Z" level=debug msg="s3aws.Walk("/docker/registry/v2/repositories")" go.version=go1.21.5 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry trace.duration=8.288706088s trace.file=github.com/distribution/distribution/v3/registry/storage/driver/base/base.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk" trace.id=64255554-37c4-4dee-aaf1-9aa693390ded trace.line=229 version=3.0.0-alpha.1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | 2024/05/20 01:55:11 traces export: Post "https://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:10.826146919Z" level=debug msg="authorizing request" go.version=go1.21.5 http.request.host="localhost:5000" http.request.id=0be37451-4e52-4594-bb40-e6b15de5988c http.request.method=GET http.request.remoteaddr="127.0.0.1:35320" http.request.uri=/v2/ http.request.useragent=Wget instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:10.826198132Z" level=info msg="response completed" go.version=go1.21.5 http.request.host="localhost:5000" http.request.id=0be37451-4e52-4594-bb40-e6b15de5988c http.request.method=GET http.request.remoteaddr="127.0.0.1:35320" http.request.uri=/v2/ http.request.useragent=Wget http.response.contenttype=application/json http.response.duration="108.405ยตs" http.response.status=200 http.response.written=2 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | 127.0.0.1 - - [20/May/2024:01:55:10 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Wget" | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:10.709137946Z" level=info msg="PurgeUploads starting: olderThan=2024-05-13 01:55:10.709083583 +0000 UTC m=-601619.974406220, actuallyDelete=true" | registry |
20 May 2024 at 11:55 (UTC+10:00) | time="2024-05-20T01:55:06.864000795Z" level=info msg="response completed" go.version=go1.21.5 http.request.host="192.168.12.41:5000" http.request.id=0f79f86a-f3c1-4e7a-aa97-7c47270aac43 http.request.method=GET http.request.remoteaddr="192.168.11.226:12120" http.request.uri=/v2/ http.request.useragent=ELB-HealthChecker/2.0 http.response.contenttype=application/json http.response.duration="103.999ยตs" http.response.status=200 http.response.written=2 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 | registry |
20 May 2024 at 11:55 (UTC+10:00) | 192.168.11.226 - - [20/May/2024:01:55:06 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "ELB-HealthChecker/2.0" |
Version: 3.0.0-alpha.1
Reproduce
- ECS starts Task, docker registry container,
- passed container healthcheck
- passed ELB healthcheck
- pull image is working fine
- crashed after log shows starting "purgeuploads.go"
- ECS Task starts a new task, repeat 1-5
crash happens every ~ 50 minutes
Expected behavior
The docker registry should be able to running without crash
registry version
Version: 3.0.0-alpha.1
Additional Info
No response
- name: docker-registry
image: registry:3.0.0-alpha.1
essential: true
environment:- name: REGISTRY_STORAGE
value: s3 - name: REGISTRY_STORAGE_S3_BUCKET
value: s3-backend-storage - name: REGISTRY_STORAGE_S3_REGION
value: ap-southeast-2 - name: REGISTRY_STORAGE_S3_ENCRYPT
value: 'true' - name: SEARCH_BACKEND
value: sqlalchemy - name: REGISTRY_LOG_LEVEL
value: debug
portMappings: - containerPort: 5000
hostPort: 5000
- name: REGISTRY_STORAGE
Thanks anyone will have a look my issue.
This seems like a bug. It seems path.Split
returns empty file
string here: https://github.com/distribution/distribution/blob/576d93fee39c22ef1e4735d932ccef32ee49f7fd/registry/storage/purgeuploads.go#L72C14-L72C24
But this can only happen if
filePath
is empty string i.e.path.Split("")
filePath
ends with a slash i.e.path.Split("foo/bar/")
I'm guessing this was run on a repo that never had anything uploaded in it thus never had _uploads
subdir created? I'm speculating here ๐ค
Tests pass because we explicitly create the _uploads
directory by uploading test data:
distribution/registry/storage/purgeuploads_test.go
Lines 25 to 31 in 576d93f
Thanks @milosgajdos for your quick reply.
The s3 bucket (s3-backend-storage) we are using is not new. we had old docker registry 2.8.1 using it for developer pull/push the image.
the structure of s3 bucket is:
s3-backend-storage:
- docker/
- registry/
- v2/
- blobs/
- sha256/
- repositories/
- all repos/
- blobs/
- v2/
- registry/
I have quickly checked the repo, there is no folder named _uploads
in any repo's folder, even the image was uploaded, , is this _uploads
required to be there in bucket?, and where should this folder be placed in bucket structure?
Technically it should be there but if you have an S3 bucket lifecycle policy set up that deletes expired data in it that cross a predefined age threshold then it will have been purged by S3 lifecycle policy, which frankly is probably better solution for purging
@milosgajdos thanks for your message
would you please suggest where should this _uploads
folder placed in bucket's structure? or do you have a workaround to bypass the issue?
Thank you, much appreciated.
would you please suggest where should this _uploads folder placed in bucket's structure?
distribution/registry/storage/paths.go
Lines 107 to 109 in e0a54de
do you have a workaround to bypass the issue?
There is no workaround I'm afraid, other than creating the _uploads
subdirectory in the above-given path.
This is a bug that needs fixing. Thanks for reporting it!
@milosgajdos thank you.
Is this bug only for Version: 3.0.0-alpha.1? or the bug is also in earlier 2.8.x version?
How this _uploads
folder was created in the first place? when we push an image, it doesn't seems to create _uploads
folder.
i found this from search. please see the red highlight part for explanation of uploads folder.
How this _uploads folder was created in the first place?
This is cleaned up by Commit
or Cancel
. It does get created but your uploads have succeeded so it was removed; note that if you cancelled them or the upload failed you would've found the uploads there; equally if you watch the key I'd expect you to see the uploads dir to see during image upload
@milosgajdos thanks for explanation.
If there is no workaround before fix, i will try going back to 2.8.x version, but i have to disable StorageDriver health check, otherwise, the registry will crash after startup.
- name: REGISTRY_HEALTH_STORAGEDRIVER_ENABLED
value: "false"
If there is no workaround before fix, i will try going back to 2.8.x version
I would strongly recommend against this. v2.8.x
hasn't got any bugfixes for I don't know how long; it's basically in a maintenance mode at this point.
If I were you I would look into setting up something like this https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html
I am kind of still puzzled as to how this is even possible to happen ๐ค
As I said earlier:
But this can only happen if
filePath is empty string i.e. path.Split("")
filePath ends with a slash i.e. path.Split("foo/bar/")
Could there be a bug in Walk
wherein it returns an empty string (skeptical) or "directory" (i.e. key prefix ending in "/") @Jamstah ?
@MXGong would it be possible for you to send us the most important part of your issue:
tree breakdown of <root>/v2/repositories/<name>/
-- you can redact the actual data but this is kinda important.
Ideally, though, it'd be great if we could see the contents of filePath
or what the fileInfo
resolves to here when the code panics:
@milosgajdos thanks for your reply
I would strongly recommend against this. v2.8.x hasn't got any bugfixes for I don't know how long; it's basically in a maintenance mode at this point.
Another reason We might go back 2.8.x is, when I test v3.0, pull the old image, it has the following error, seems like 3.0 no longer support schema version 1 ?
time="2024-05-20T10:31:35.921555838Z" level=error msg="response completed with error" err.code=unknown err.detail="unrecognized manifest schema version 1" err.message="unknown error" go.version=go1.21.5 http.request.host="registry.domain.com:5000" http.request.id=9cb3700e-28ed-4658-8edb-212b64cfd93e http.request.method=GET http.request.remoteaddr=192.168.11.83 http.request.uri=/v2/testapp/ruby24-git/manifests/latest http.request.useragent="docker/20.10.11 go/go1.16.9 git-commit/847da18 kernel/5.10.76-linuxkit os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.11 \(darwin\))" http.response.contenttype=application/json http.response.duration=58.297951ms http.response.status=500 http.response.written=108 instance.id=61b72780-24f1-4f5b-bac0-a6de6d9d4b37 service=registry vars.name=testapp/ruby24-git vars.reference=latest version=3.0.0-alpha.1
tree breakdown of <root>/v2/repositories/<name>/
-- you can redact the actual data but this is kinda important
Here is the structure:
docker/registry/v2/repositories/registry_test/
docker/registry/v2/repositories/registry_test/_layers/sha256/59bed7b58b2174cdf07033f91b9c5b02a5a1e8c5dbeca67a9f78e7e206170ba7/link
docker/registry/v2/repositories/registry_test/_manifests/
docker/registry/v2/repositories/registry_test/_manifests/revisions/sha256/2b35986f86faea9652520ecf2aee0bca52d59ad799841a22329f97484e68ba1a/link
docker/registry/v2/repositories/registry_test/_manifests/tags/
docker/registry/v2/repositories/registry_test/_manifests/tags/latest/current/link
docker/registry/v2/repositories/registry_test/_manifests/tags/latest/index/sha256/2b35986f86faea9652520ecf2aee0bca52d59ad799841a22329f97484e68ba1a/link
seems like 3.0 no longer support schema version 1 ?
This is correct. Please refrain from using the v1 image format. It's badly designed and is actually deprecated now!
HI @milosgajdos good day, just a quick question,
does this env variable still available for v3?
REGISTRY_COMPATIBILITY_SCHEMA1_ENABLED: true
does this env variable still available for v3?
REGISTRY_COMPATIBILITY_SCHEMA1_ENABLED: true
No it isn't. Schema1 support has been deprecated. As I said, the image format was deprecated https://distribution.github.io/distribution/spec/deprecated-schema-v1/
I haven't been able to reproduce this on the latest tag i.e. edge
Purge config:
maintenance:
uploadpurging:
enabled: true
age: 1m
interval: 1m
dryrun: false
I've pushed hello-world
image to registry and then waited for the purge to kick in.
NOTE: since the push was successful there was no uploads
directory in the hello-world repo
Purge succeeds successfully:
time="2024-05-26T17:33:20.076717576Z" level=info msg="PurgeUploads starting: olderThan=2024-05-26 17:32:20.076638118 +0000 UTC m=+60.032612417, actuallyDelete=true"
Filepath: "/docker/registry/v2/repositories/distribution"
File: "distribution"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world"
File: "hello-world"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world/_layers"
File: "_layers"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world/_manifests"
File: "_manifests"
time="2024-05-26T17:33:20.103136812Z" level=debug msg="s3aws.ListObjectsV2PagesWithContext({\n Bucket: \"images-local\",\n MaxKeys: 1000,\n Prefix: \"registry-v2/docker/registry/v2/repositories/\",\n StartAfter: \"registry-v2\"\n})" environment=local go.version=go1.21.8 instance.id=a7b44b8f-ba0c-4bca-ad4d-4f57f38a91b8 service=registry trace.duration=26.161319ms trace.file=github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=a7acc780-353e-4522-82ab-5f26917df3a2 trace.line=1111 version=3.0.0-alpha.1-155-ge0a54de7.m
time="2024-05-26T17:33:20.103223895Z" level=info msg="Purge uploads finished. Num deleted=0, num errors=0"
I don't think this would be a missing dir problem - s3 doesn't really have the concept of directories.
The purge tests only run on the inmemory driver. If we think this is to do with the s3 walk function returning odd values for the path split function, then perhaps adding a case to TestWalk
in s3_test.go
is the best way to try and reproduce.
@MXGong a couple of questions:
- Is there only one process using the s3 bucket at the same time or are multiple replicas?
- Are there active users of the registry?
- Can you try running it without any active users to see if it still fails?
You could lower the purge interval in the config to see if you can trigger the error more quickly, but at the end of the day, we really need a way to reproduce this if we're going to be able to fix it. If you can configure a minimum sized bucket that exhibits the error that would be amazing.
The purge tests only run on the inmemory driver. If we think this is to do with the s3 walk function returning odd values for the path split function, then perhaps adding a case to TestWalk in s3_test.go is the best way to try and reproduce.
I actually ran a sample test in S3 with the repo structure as suggested by @MXGong and hardcoded purge timer and couldn't repro. I don't think it's to do with the walk either, but it seemed like the likeliest place when I scanned the issue and quickly browsed the code.
Hi,
We're having same/similiar problems with our docker registry
Below are the stacktrace and the configuration we're running, I hope this will help
docker-compose.yml
version: '3.8'
services:
registry:
image: registry:2.8.3
networks:
- nginx_proxy_manager_proxy
- default
# ports:
# - 5000:5000
volumes:
- /mnt/docker/docker/volumes/registry/data:/var/lib/registry
- /mnt/docker/docker/volumes/registry/certs:/certs
- /mnt/docker/docker/volumes/registry/auth:/auth
environment:
# https://docs.docker.com/registry/configuration
# https://docs.docker.com/registry/storage-drivers/
- REGISTRY_HTTP_SECRET=<redacted>
- REGISTRY_STORAGE_DELETE_ENABLED=true
- REGISTRY_STORAGE=s3
- REGISTRY_STORAGE_S3_ACCESSKEY=<redacted>
- REGISTRY_STORAGE_S3_SECRETKEY=<redacted>
- REGISTRY_STORAGE_S3_REGION=eu-central-1
- REGISTRY_STORAGE_S3_BUCKET=<redacted>
logging:
driver: "json-file"
options:
max-size: "5m"
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role != manager
- node.labels.type == t3.medium
restart_policy:
delay: 5s
max_attempts: 3
window: 120s
Stacktrace
2024-08-30T09:03:16.147761529Z panic: runtime error: index out of range [0] with length 0
2024-08-30T09:03:16.147867905Z
2024-08-30T09:03:16.148186949Z goroutine 21 [running]:
2024-08-30T09:03:16.149645331Z github.com/docker/distribution/registry/storage.getOutstandingUploads.func1({0xf29dd8, 0xc0000a4480})
2024-08-30T09:03:16.150151546Z github.com/docker/distribution/registry/storage/purgeuploads.go:73 +0x54c
2024-08-30T09:03:16.150594005Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa880, 0x20?)
2024-08-30T09:03:16.151043410Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1023 +0x5d3
2024-08-30T09:03:16.154307680Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc00041c540}, 0xc0004ccc30, 0xc000449060, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.154806082Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156161041Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc00041c460}, 0xc000449ac0, {0xc000496b00, 0x34}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156851251Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156861243Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa780, 0xb0?)
2024-08-30T09:03:16.156865169Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1037 +0x705
2024-08-30T09:03:16.156868773Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc00041c460}, 0xc0004cca50, 0xc000449538, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.156935483Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156939893Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc0000e8b60}, 0xc000449ac0, {0xc00003ea80, 0x2b}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156943843Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156947186Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa900, 0x3?)
2024-08-30T09:03:16.156963436Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1037 +0x705
2024-08-30T09:03:16.156967141Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc0000e8b60}, 0xc0004ccc80, 0xc000449a10, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.156987916Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156991664Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc0000e8af0}, 0xc0005c5ac0, {0xc00003e6c1, 0x20}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156995840Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156999478Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).Walk(0xc0004c50c0?, {0xf29320, 0xc0000e8af0}, {0xc0004c50c0, 0x20}, 0x508b5c?)
2024-08-30T09:03:16.157002593Z github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:930 +0x155
2024-08-30T09:03:16.158167281Z github.com/docker/distribution/registry/storage/driver/base.(*Base).Walk(0xc0003e12b0, {0xf29438?, 0xc0004008c0?}, {0xc0004c50c0, 0x20}, 0xc0000a0201?)
2024-08-30T09:03:16.158177491Z github.com/docker/distribution/registry/storage/driver/base/base.go:239 +0x24e
2024-08-30T09:03:16.158180905Z github.com/docker/distribution/registry/storage.getOutstandingUploads({0xf29438, 0xc0004008c0}, {0xf2fdb0?, 0xc0003e12b0})
/etc/docker/registry/config.yml
version: 0.1
log:
fields:
service: registry
storage:
cache:
blobdescriptor: inmemory
filesystem:
rootdirectory: /var/lib/registry
http:
addr: :5000
headers:
X-Content-Type-Options: [nosniff]
health:
storagedriver:
enabled: true
interval: 10s
threshold: 3
Please, stop using 2.8.X
version @ict-pend-group-com
It has not been updated for year and neither will it be going forward. Use the latest v3
release.
Hi @milosgajdos ,
Might I suggest then that a official release of v3
be created, instead of a beta release โ
Since the latest (official) version, according to GitHub, is 2.8.3
Might I suggest then that a official release of v3 be created, instead of a beta release โ
You very well might, but I'd rather ping GitHub and ask them why they hide pre-releases. The latest v3 release was marked as a beta pre-release and GH doesnt display those. Here's a super secret trick: click on https://github.com/distribution/distribution/releases
We are hoping to make a stable release by the end of this year, but the beta is by magnitudes more stable and bugfree than any v2.8.x
release
I have a customer running into this issue as well. Their _uploads
dir is there, so unfortunately I don't think that's the root cause.
Is there any interest in merging a simple check to avoid the panic? something like this?
This would allow the registry to continue running in this situation while also logging the culprit path for further inspection.
The current behavior we're seeing is that the registry stays running for a few minutes, then the panic happens which breaks the container which triggers k8s to recreate it in a loop every few minutes. Not the end of the world, but quite annoying.
Additionally, a workaround could be to disable uploads purging entirely. i.e set REGISTRY_MAINTENANCE_UPLOADPURGING_ENABLED
to false
(or via config.yaml). Note that this might end up growing storage consumption due to lack of purging.
Is there any interest in merging a simple check to avoid the panic? something like this?
I would really prefer to have a proper reproduction rather than masking the symptoms with some workaround. Right now we do not know what's causing this and adding some code that might remain in the codebase forever is not the best idea. I've tried to repro this but I couldn't.
Below are debug level logs right before the panic. I don't see anything new in there, but sharing in case y'all do.
2024-09-09T17:04:38.343273205+08:00 time="2024-09-09T09:04:38.343099652Z" level=info msg="PurgeUploads starting: olderThan=2024-09-02 09:04:38.342982452 +0000 UTC m=-603539.956536565, actuallyDelete=true"
2024-09-09T17:04:41.076712204+08:00 time="2024-09-09T09:04:41.076587704Z" level=debug msg="s3aws.ListObjectsV2Pages(docker/registry/v2/repositories/)" go.version="go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime" trace.duration=2.733330597s trace.file=/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=f68196a7-83d7-4a53-b34c-74a9d6b1b1b2 trace.line=1123 trace.parent.id=3207cb4a-8cde-455e-b28f-9551bce2bf01
2024-09-09T17:04:41.076712204+08:00 time="2024-09-09T09:04:41.076673648Z" level=debug msg="s3aws.Walk(\"/docker/registry/v2/repositories\")" go.version="go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime" trace.duration=2.733476311s trace.file=/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/base/base.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk" trace.id=3207cb4a-8cde-455e-b28f-9551bce2bf01 trace.line=230
2024-09-09T17:04:41.078979933+08:00 panic: runtime error: index out of range [0] with length 0
2024-09-09T17:04:41.078979933+08:00
2024-09-09T17:04:41.078979933+08:00 goroutine 102 [running]:
2024-09-09T17:04:41.078979933+08:00 github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads.func1({0x29c8620, 0xc000e6a980})
2024-09-09T17:04:41.078979933+08:00 /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:73 +0x54c
2024-09-09T17:04:41.078979933+08:00 github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc000e134d0, 0x0?)
2024-09-09T17:04:41.078979933+08:00 /go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1171 +0x45c
2024-09-09T17:04:41.079053850+08:00 github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc000c08010, {0x29c8498?, 0xc000629ea0}, 0xc000629e30, 0xc0006cba08, {0x0, 0x0, 0x0})
2024-09-09T17:04:41.079053850+08:00 /go/src/github.com/openshift/image-registry/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:7629 +0x1d0
2024-09-09T17:04:41.079053850+08:00 github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc0006242d0, {0x29c8498, 0xc000629dc0}, 0xc000a4dab8, {0xc0014bec61, 0x20}, {0x2999cb8, 0x1}, 0xc000ed4f90)
[...]
What that debug log is telling me is the following:
- seems like you're not using the latest release? There is no
ListObjectsV2Pages
func in S3 driver but there isListObjectsV2PagesWithContext
distribution/registry/storage/driver/s3-aws/s3.go
Line 1152 in 93e6b90
- Now, more importantly, it would appear that the
repositories
keyspace is either empty ๐ค or there is something fishy withdistribution/registry/storage/driver/s3-aws/s3.go
Line 1167 in 93e6b90
storagedriver.FileInfoFields.Path
being set to""
which then leads to returning""
in place offile
which then leads to thepanic
I havent dug in more, my time is expensive, so feel free. My suspicion is either strings.Replace
or directoryDiff
causing some unexpected havoc
Hey Milos, thanks for having a look!
seems like you're not using the latest release?
Yes, we're a bit behind - we're using a pre-alpha v3 from around June 2023.
it would appear that the
repositories
keyspace is either empty ๐ค or there is something fishy with
I got a heavily edited list of the customer's storage from support which indicates it isn't empty:
docker/registry/v2/blobs
docker/registry/v2/repositories/<project_name>/<image_stream>/_layers
docker/registry/v2/repositories/<project_name>/<image_stream>/_manifests
docker/registry/v2/repositories/<project_name>/<image_stream>/_uploads
which leads to
storagedriver.FileInfoFields.Path
being set to""
which then leads to [...]
True. Either ""
or ending in /
would make path.Split
return an empty file ๐ค
Unfortunately it's unclear from the output whether they have any actual contents in their registry or not. I've asked for clarification.
True. Either "" or ending in / would make path.Split return an empty file ๐ค
Yeah as I said in one of my earlier comments #4358 (comment)
I wasnt able to repro it the last time I tried but I had a quick look again just now and I zoomed in on
distribution/registry/storage/driver/s3-aws/s3.go
Line 1167 in 93e6b90
filePath
ending with /
which would then be set in distribution/registry/storage/driver/s3-aws/s3.go
Line 1186 in 93e6b90
which could then trigger the panic in PurgeUploads
(due to path.Split
). Maybe ๐ค That code (Walk
) is a bit too complex for my liking
See a small playground here
That'd imply though that ListObjectsV2PagesWithContext
would return a key ending with /
so this might be a red herring ๐คทโโ๏ธ Needs someone to spend proper time on this. I just skim it here and there in b/w paid work
I appreciate your time and input @milosgajdos. I also haven't really had a chance to give this the attention it needs.
I'm thinking I'll take someone's suggestion of trying to reproduce this in the s3 driver tests, hopefully sometime this or next week (if someone doesn't beat me to it).