distribution/distribution

Registry keeps crashing after starting "purgeuploads.go"

MXGong opened this issue ยท 30 comments

Description

We are building docker registry in ECS with Fargate, s3 as backend storage.

registry container keeps crashing after log shows "PurgeUploads starting"

Related debug log:

===========

Timestamp (UTC+10:00) Message Container
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage.PurgeUploads({0x1a383a0, 0xc000718780}, {0x1a403b8?, 0xc000708950?}, {0xc1887a9faa43c1bf, 0xfffddcd46f22bfb4, 0x24f3e60}, 0x1) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:34 +0x12d registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/handlers.startUploadPurger.func1() registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/handlers/app.go:1032 +0x33f registry
20 May 2024 at 11:55 (UTC+10:00) created by github.com/distribution/distribution/v3/registry/handlers.startUploadPurger in goroutine 1 registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/handlers/app.go:1020 +0x329 registry
20 May 2024 at 11:55 (UTC+10:00) panic: runtime error: index out of range [0] with length 0 registry
20 May 2024 at 11:55 (UTC+10:00) goroutine 21 [running]: registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads.func1({0x1a384b8, 0xc001862c80}) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:73 +0x54c registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc00015cc60, 0x0?) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1159 +0x45c registry
20 May 2024 at 11:55 (UTC+10:00) github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00049a690, {0x1a38870?, 0xc00046b570}, 0xc00046b500, 0xc001563a18, {0x0, 0x0, 0x0}) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/aws/aws-sdk-go@v1.48.10/service/s3/api.go:7629 +0x1d0 registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc0004bb170, {0x1a38870, 0xc00046b490}, 0xc000583ac0, {0xc00009a660?, 0x20?}, {0x0, 0x0}, 0xc00074e480) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1123 +0x447 registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).Walk(0xc00009a660?, {0x1a38870, 0xc00046b490}, {0xc00009a660, 0x20}, 0x155a420?, {0x0, 0x0, 0x7779f8?}) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1077 +0xdc registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk(0xc000708950, {0x1a383a0?, 0xc000718780?}, {0xc00009a660, 0x20}, 0x410885?, {0x0, 0x0, 0x0}) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/driver/base/base.go:236 +0x27f registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads({0x1a383a0, 0xc000718780}, {0x1a403b8?, 0xc000708950}) registry
20 May 2024 at 11:55 (UTC+10:00) github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:70 +0x202 registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:18.997854099Z" level=debug msg="s3aws.ListObjectsV2PagesWithContext({\n Bucket: "s3-backend-storage",\n MaxKeys: 1000,\n Prefix: "docker/registry/v2/repositories/",\n StartAfter: ""\n})" go.version=go1.21.5 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry trace.duration=8.288609547s trace.file=github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=be2f0c8a-e911-4767-bf4a-cde149409695 trace.line=1111 trace.parent.id=64255554-37c4-4dee-aaf1-9aa693390ded version=3.0.0-alpha.1 registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:18.997899441Z" level=debug msg="s3aws.Walk("/docker/registry/v2/repositories")" go.version=go1.21.5 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry trace.duration=8.288706088s trace.file=github.com/distribution/distribution/v3/registry/storage/driver/base/base.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk" trace.id=64255554-37c4-4dee-aaf1-9aa693390ded trace.line=229 version=3.0.0-alpha.1 registry
20 May 2024 at 11:55 (UTC+10:00) 2024/05/20 01:55:11 traces export: Post "https://localhost:4318/v1/traces": dial tcp 127.0.0.1:4318: connect: connection refused registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:10.826146919Z" level=debug msg="authorizing request" go.version=go1.21.5 http.request.host="localhost:5000" http.request.id=0be37451-4e52-4594-bb40-e6b15de5988c http.request.method=GET http.request.remoteaddr="127.0.0.1:35320" http.request.uri=/v2/ http.request.useragent=Wget instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:10.826198132Z" level=info msg="response completed" go.version=go1.21.5 http.request.host="localhost:5000" http.request.id=0be37451-4e52-4594-bb40-e6b15de5988c http.request.method=GET http.request.remoteaddr="127.0.0.1:35320" http.request.uri=/v2/ http.request.useragent=Wget http.response.contenttype=application/json http.response.duration="108.405ยตs" http.response.status=200 http.response.written=2 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 registry
20 May 2024 at 11:55 (UTC+10:00) 127.0.0.1 - - [20/May/2024:01:55:10 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "Wget" registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:10.709137946Z" level=info msg="PurgeUploads starting: olderThan=2024-05-13 01:55:10.709083583 +0000 UTC m=-601619.974406220, actuallyDelete=true" registry
20 May 2024 at 11:55 (UTC+10:00) time="2024-05-20T01:55:06.864000795Z" level=info msg="response completed" go.version=go1.21.5 http.request.host="192.168.12.41:5000" http.request.id=0f79f86a-f3c1-4e7a-aa97-7c47270aac43 http.request.method=GET http.request.remoteaddr="192.168.11.226:12120" http.request.uri=/v2/ http.request.useragent=ELB-HealthChecker/2.0 http.response.contenttype=application/json http.response.duration="103.999ยตs" http.response.status=200 http.response.written=2 instance.id=03bd67a0-3f45-446a-8b0b-bb33c8c1e548 service=registry version=3.0.0-alpha.1 registry
20 May 2024 at 11:55 (UTC+10:00) 192.168.11.226 - - [20/May/2024:01:55:06 +0000] "GET /v2/ HTTP/1.1" 200 2 "" "ELB-HealthChecker/2.0"

Version: 3.0.0-alpha.1

Reproduce

  1. ECS starts Task, docker registry container,
  2. passed container healthcheck
  3. passed ELB healthcheck
  4. pull image is working fine
  5. crashed after log shows starting "purgeuploads.go"
  6. ECS Task starts a new task, repeat 1-5

crash happens every ~ 50 minutes

Expected behavior

The docker registry should be able to running without crash

registry version

Version: 3.0.0-alpha.1

Additional Info

No response

  • name: docker-registry
    image: registry:3.0.0-alpha.1
    essential: true
    environment:
    • name: REGISTRY_STORAGE
      value: s3
    • name: REGISTRY_STORAGE_S3_BUCKET
      value: s3-backend-storage
    • name: REGISTRY_STORAGE_S3_REGION
      value: ap-southeast-2
    • name: REGISTRY_STORAGE_S3_ENCRYPT
      value: 'true'
    • name: SEARCH_BACKEND
      value: sqlalchemy
    • name: REGISTRY_LOG_LEVEL
      value: debug
      portMappings:
    • containerPort: 5000
      hostPort: 5000

Thanks anyone will have a look my issue.

This seems like a bug. It seems path.Split returns empty file string here: https://github.com/distribution/distribution/blob/576d93fee39c22ef1e4735d932ccef32ee49f7fd/registry/storage/purgeuploads.go#L72C14-L72C24

But this can only happen if

  • filePath is empty string i.e. path.Split("")
  • filePath ends with a slash i.e. path.Split("foo/bar/")

I'm guessing this was run on a repo that never had anything uploaded in it thus never had _uploads subdir created? I'm speculating here ๐Ÿค”

Tests pass because we explicitly create the _uploads directory by uploading test data:

dataPath, err := pathFor(uploadDataPathSpec{name: repo, id: uploadID})
if err != nil {
t.Fatalf("Unable to resolve path")
}
if err := d.PutContent(ctx, dataPath, []byte("")); err != nil {
t.Fatalf("Unable to write data file")
}

Thanks @milosgajdos for your quick reply.

The s3 bucket (s3-backend-storage) we are using is not new. we had old docker registry 2.8.1 using it for developer pull/push the image.

the structure of s3 bucket is:
s3-backend-storage:

  • docker/
    • registry/
      • v2/
        • blobs/
          • sha256/
        • repositories/
          • all repos/

I have quickly checked the repo, there is no folder named _uploads in any repo's folder, even the image was uploaded, , is this _uploads required to be there in bucket?, and where should this folder be placed in bucket structure?

Technically it should be there but if you have an S3 bucket lifecycle policy set up that deletes expired data in it that cross a predefined age threshold then it will have been purged by S3 lifecycle policy, which frankly is probably better solution for purging

@milosgajdos thanks for your message

would you please suggest where should this _uploads folder placed in bucket's structure? or do you have a workaround to bypass the issue?

Thank you, much appreciated.

would you please suggest where should this _uploads folder placed in bucket's structure?

// uploadDataPathSpec: <root>/v2/repositories/<name>/_uploads/<id>/data
// uploadStartedAtPathSpec: <root>/v2/repositories/<name>/_uploads/<id>/startedat
// uploadHashStatePathSpec: <root>/v2/repositories/<name>/_uploads/<id>/hashstates/<algorithm>/<offset>

do you have a workaround to bypass the issue?

There is no workaround I'm afraid, other than creating the _uploads subdirectory in the above-given path.
This is a bug that needs fixing. Thanks for reporting it!

@milosgajdos thank you.

Is this bug only for Version: 3.0.0-alpha.1? or the bug is also in earlier 2.8.x version?

How this _uploads folder was created in the first place? when we push an image, it doesn't seems to create _uploads folder.

i found this from search. please see the red highlight part for explanation of uploads folder.

image

How this _uploads folder was created in the first place?

This is cleaned up by Commit or Cancel. It does get created but your uploads have succeeded so it was removed; note that if you cancelled them or the upload failed you would've found the uploads there; equally if you watch the key I'd expect you to see the uploads dir to see during image upload

@milosgajdos thanks for explanation.

If there is no workaround before fix, i will try going back to 2.8.x version, but i have to disable StorageDriver health check, otherwise, the registry will crash after startup.

    - name: REGISTRY_HEALTH_STORAGEDRIVER_ENABLED
      value: "false"

If there is no workaround before fix, i will try going back to 2.8.x version

I would strongly recommend against this. v2.8.x hasn't got any bugfixes for I don't know how long; it's basically in a maintenance mode at this point.

If I were you I would look into setting up something like this https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-expire-general-considerations.html

I am kind of still puzzled as to how this is even possible to happen ๐Ÿค”

As I said earlier:

But this can only happen if

filePath is empty string i.e. path.Split("")
filePath ends with a slash i.e. path.Split("foo/bar/")

Could there be a bug in Walk wherein it returns an empty string (skeptical) or "directory" (i.e. key prefix ending in "/") @Jamstah ?

@MXGong would it be possible for you to send us the most important part of your issue:
tree breakdown of <root>/v2/repositories/<name>/ -- you can redact the actual data but this is kinda important.

Ideally, though, it'd be great if we could see the contents of filePath or what the fileInfo resolves to here when the code panics:

filePath := fileInfo.Path()

@milosgajdos thanks for your reply

I would strongly recommend against this. v2.8.x hasn't got any bugfixes for I don't know how long; it's basically in a maintenance mode at this point.

Another reason We might go back 2.8.x is, when I test v3.0, pull the old image, it has the following error, seems like 3.0 no longer support schema version 1 ?

time="2024-05-20T10:31:35.921555838Z" level=error msg="response completed with error" err.code=unknown err.detail="unrecognized manifest schema version 1" err.message="unknown error" go.version=go1.21.5 http.request.host="registry.domain.com:5000" http.request.id=9cb3700e-28ed-4658-8edb-212b64cfd93e http.request.method=GET http.request.remoteaddr=192.168.11.83 http.request.uri=/v2/testapp/ruby24-git/manifests/latest http.request.useragent="docker/20.10.11 go/go1.16.9 git-commit/847da18 kernel/5.10.76-linuxkit os/linux arch/amd64 UpstreamClient(Docker-Client/20.10.11 \(darwin\))" http.response.contenttype=application/json http.response.duration=58.297951ms http.response.status=500 http.response.written=108 instance.id=61b72780-24f1-4f5b-bac0-a6de6d9d4b37 service=registry vars.name=testapp/ruby24-git vars.reference=latest version=3.0.0-alpha.1

tree breakdown of <root>/v2/repositories/<name>/ -- you can redact the actual data but this is kinda important

Here is the structure:

docker/registry/v2/repositories/registry_test/
docker/registry/v2/repositories/registry_test/_layers/sha256/59bed7b58b2174cdf07033f91b9c5b02a5a1e8c5dbeca67a9f78e7e206170ba7/link
docker/registry/v2/repositories/registry_test/_manifests/
docker/registry/v2/repositories/registry_test/_manifests/revisions/sha256/2b35986f86faea9652520ecf2aee0bca52d59ad799841a22329f97484e68ba1a/link
docker/registry/v2/repositories/registry_test/_manifests/tags/
docker/registry/v2/repositories/registry_test/_manifests/tags/latest/current/link
docker/registry/v2/repositories/registry_test/_manifests/tags/latest/index/sha256/2b35986f86faea9652520ecf2aee0bca52d59ad799841a22329f97484e68ba1a/link

seems like 3.0 no longer support schema version 1 ?

This is correct. Please refrain from using the v1 image format. It's badly designed and is actually deprecated now!

HI @milosgajdos good day, just a quick question,
does this env variable still available for v3?

REGISTRY_COMPATIBILITY_SCHEMA1_ENABLED: true

does this env variable still available for v3?

REGISTRY_COMPATIBILITY_SCHEMA1_ENABLED: true

No it isn't. Schema1 support has been deprecated. As I said, the image format was deprecated https://distribution.github.io/distribution/spec/deprecated-schema-v1/

I haven't been able to reproduce this on the latest tag i.e. edge

Purge config:

  maintenance:
    uploadpurging:
      enabled: true
      age: 1m
      interval: 1m
      dryrun: false

I've pushed hello-world image to registry and then waited for the purge to kick in.

NOTE: since the push was successful there was no uploads directory in the hello-world repo

Purge succeeds successfully:

time="2024-05-26T17:33:20.076717576Z" level=info msg="PurgeUploads starting: olderThan=2024-05-26 17:32:20.076638118 +0000 UTC m=+60.032612417, actuallyDelete=true"
Filepath: "/docker/registry/v2/repositories/distribution"
File: "distribution"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world"
File: "hello-world"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world/_layers"
File: "_layers"
Filepath: "/docker/registry/v2/repositories/distribution/hello-world/_manifests"
File: "_manifests"
time="2024-05-26T17:33:20.103136812Z" level=debug msg="s3aws.ListObjectsV2PagesWithContext({\n  Bucket: \"images-local\",\n  MaxKeys: 1000,\n  Prefix: \"registry-v2/docker/registry/v2/repositories/\",\n  StartAfter: \"registry-v2\"\n})" environment=local go.version=go1.21.8 instance.id=a7b44b8f-ba0c-4bca-ad4d-4f57f38a91b8 service=registry trace.duration=26.161319ms trace.file=github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=a7acc780-353e-4522-82ab-5f26917df3a2 trace.line=1111 version=3.0.0-alpha.1-155-ge0a54de7.m
time="2024-05-26T17:33:20.103223895Z" level=info msg="Purge uploads finished.  Num deleted=0, num errors=0"

I don't think this would be a missing dir problem - s3 doesn't really have the concept of directories.

The purge tests only run on the inmemory driver. If we think this is to do with the s3 walk function returning odd values for the path split function, then perhaps adding a case to TestWalk in s3_test.go is the best way to try and reproduce.

@MXGong a couple of questions:

  • Is there only one process using the s3 bucket at the same time or are multiple replicas?
  • Are there active users of the registry?
  • Can you try running it without any active users to see if it still fails?

You could lower the purge interval in the config to see if you can trigger the error more quickly, but at the end of the day, we really need a way to reproduce this if we're going to be able to fix it. If you can configure a minimum sized bucket that exhibits the error that would be amazing.

The purge tests only run on the inmemory driver. If we think this is to do with the s3 walk function returning odd values for the path split function, then perhaps adding a case to TestWalk in s3_test.go is the best way to try and reproduce.

I actually ran a sample test in S3 with the repo structure as suggested by @MXGong and hardcoded purge timer and couldn't repro. I don't think it's to do with the walk either, but it seemed like the likeliest place when I scanned the issue and quickly browsed the code.

Hi,

We're having same/similiar problems with our docker registry

Below are the stacktrace and the configuration we're running, I hope this will help

docker-compose.yml

version: '3.8'

services:
  registry:
    image: registry:2.8.3
    networks:
      - nginx_proxy_manager_proxy
      - default
  #  ports:
  #    - 5000:5000
    volumes:
      - /mnt/docker/docker/volumes/registry/data:/var/lib/registry
      - /mnt/docker/docker/volumes/registry/certs:/certs
      - /mnt/docker/docker/volumes/registry/auth:/auth
    environment:
      # https://docs.docker.com/registry/configuration
      # https://docs.docker.com/registry/storage-drivers/
      - REGISTRY_HTTP_SECRET=<redacted>
      - REGISTRY_STORAGE_DELETE_ENABLED=true
      - REGISTRY_STORAGE=s3
      - REGISTRY_STORAGE_S3_ACCESSKEY=<redacted>
      - REGISTRY_STORAGE_S3_SECRETKEY=<redacted>
      - REGISTRY_STORAGE_S3_REGION=eu-central-1
      - REGISTRY_STORAGE_S3_BUCKET=<redacted>
    logging:
        driver: "json-file"
        options:
            max-size: "5m"
    deploy:
      mode: replicated
      replicas: 1
      placement: 
        constraints: 
          - node.role != manager
          - node.labels.type == t3.medium
      restart_policy:
        delay: 5s
        max_attempts: 3
        window: 120s

Stacktrace

2024-08-30T09:03:16.147761529Z panic: runtime error: index out of range [0] with length 0
2024-08-30T09:03:16.147867905Z 
2024-08-30T09:03:16.148186949Z goroutine 21 [running]:
2024-08-30T09:03:16.149645331Z github.com/docker/distribution/registry/storage.getOutstandingUploads.func1({0xf29dd8, 0xc0000a4480})
2024-08-30T09:03:16.150151546Z 	github.com/docker/distribution/registry/storage/purgeuploads.go:73 +0x54c
2024-08-30T09:03:16.150594005Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa880, 0x20?)
2024-08-30T09:03:16.151043410Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1023 +0x5d3
2024-08-30T09:03:16.154307680Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc00041c540}, 0xc0004ccc30, 0xc000449060, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.154806082Z 	github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156161041Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc00041c460}, 0xc000449ac0, {0xc000496b00, 0x34}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156851251Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156861243Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa780, 0xb0?)
2024-08-30T09:03:16.156865169Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1037 +0x705
2024-08-30T09:03:16.156868773Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc00041c460}, 0xc0004cca50, 0xc000449538, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.156935483Z 	github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156939893Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc0000e8b60}, 0xc000449ac0, {0xc00003ea80, 0x2b}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156943843Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156947186Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc0004fa900, 0x3?)
2024-08-30T09:03:16.156963436Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:1037 +0x705
2024-08-30T09:03:16.156967141Z github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc00009e878, {0x7fcd85acf9b8?, 0xc0000e8b60}, 0xc0004ccc80, 0xc000449a10, {0x0, 0x0, 0x0})
2024-08-30T09:03:16.156987916Z 	github.com/docker/distribution/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:4198 +0x1b1
2024-08-30T09:03:16.156991664Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).doWalk(0xc000395900, {0xf29320, 0xc0000e8af0}, 0xc0005c5ac0, {0xc00003e6c1, 0x20}, {0xf1d664, 0x1}, 0xc00049bf50)
2024-08-30T09:03:16.156995840Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:982 +0x36d
2024-08-30T09:03:16.156999478Z github.com/docker/distribution/registry/storage/driver/s3-aws.(*driver).Walk(0xc0004c50c0?, {0xf29320, 0xc0000e8af0}, {0xc0004c50c0, 0x20}, 0x508b5c?)
2024-08-30T09:03:16.157002593Z 	github.com/docker/distribution/registry/storage/driver/s3-aws/s3.go:930 +0x155
2024-08-30T09:03:16.158167281Z github.com/docker/distribution/registry/storage/driver/base.(*Base).Walk(0xc0003e12b0, {0xf29438?, 0xc0004008c0?}, {0xc0004c50c0, 0x20}, 0xc0000a0201?)
2024-08-30T09:03:16.158177491Z 	github.com/docker/distribution/registry/storage/driver/base/base.go:239 +0x24e
2024-08-30T09:03:16.158180905Z github.com/docker/distribution/registry/storage.getOutstandingUploads({0xf29438, 0xc0004008c0}, {0xf2fdb0?, 0xc0003e12b0})

/etc/docker/registry/config.yml

version: 0.1
log:
  fields:
    service: registry
storage:
  cache:
    blobdescriptor: inmemory
  filesystem:
    rootdirectory: /var/lib/registry
http:
  addr: :5000
  headers:
    X-Content-Type-Options: [nosniff]
health:
  storagedriver:
    enabled: true
    interval: 10s
    threshold: 3

Please, stop using 2.8.X version @ict-pend-group-com

It has not been updated for year and neither will it be going forward. Use the latest v3 release.

Hi @milosgajdos ,

Might I suggest then that a official release of v3 be created, instead of a beta release โ“

Since the latest (official) version, according to GitHub, is 2.8.3

image

Might I suggest then that a official release of v3 be created, instead of a beta release โ“

You very well might, but I'd rather ping GitHub and ask them why they hide pre-releases. The latest v3 release was marked as a beta pre-release and GH doesnt display those. Here's a super secret trick: click on https://github.com/distribution/distribution/releases

We are hoping to make a stable release by the end of this year, but the beta is by magnitudes more stable and bugfree than any v2.8.x release

I have a customer running into this issue as well. Their _uploads dir is there, so unfortunately I don't think that's the root cause.

Is there any interest in merging a simple check to avoid the panic? something like this?
This would allow the registry to continue running in this situation while also logging the culprit path for further inspection.

The current behavior we're seeing is that the registry stays running for a few minutes, then the panic happens which breaks the container which triggers k8s to recreate it in a loop every few minutes. Not the end of the world, but quite annoying.

Additionally, a workaround could be to disable uploads purging entirely. i.e set REGISTRY_MAINTENANCE_UPLOADPURGING_ENABLED to false (or via config.yaml). Note that this might end up growing storage consumption due to lack of purging.

Is there any interest in merging a simple check to avoid the panic? something like this?

I would really prefer to have a proper reproduction rather than masking the symptoms with some workaround. Right now we do not know what's causing this and adding some code that might remain in the codebase forever is not the best idea. I've tried to repro this but I couldn't.

Below are debug level logs right before the panic. I don't see anything new in there, but sharing in case y'all do.

2024-09-09T17:04:38.343273205+08:00 time="2024-09-09T09:04:38.343099652Z" level=info msg="PurgeUploads starting: olderThan=2024-09-02 09:04:38.342982452 +0000 UTC m=-603539.956536565, actuallyDelete=true"
2024-09-09T17:04:41.076712204+08:00 time="2024-09-09T09:04:41.076587704Z" level=debug msg="s3aws.ListObjectsV2Pages(docker/registry/v2/repositories/)" go.version="go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime" trace.duration=2.733330597s trace.file=/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk" trace.id=f68196a7-83d7-4a53-b34c-74a9d6b1b1b2 trace.line=1123 trace.parent.id=3207cb4a-8cde-455e-b28f-9551bce2bf01
2024-09-09T17:04:41.076712204+08:00 time="2024-09-09T09:04:41.076673648Z" level=debug msg="s3aws.Walk(\"/docker/registry/v2/repositories\")" go.version="go1.21.11 (Red Hat 1.21.11-1.el9_4) X:strictfipsruntime" trace.duration=2.733476311s trace.file=/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/base/base.go trace.func="github.com/distribution/distribution/v3/registry/storage/driver/base.(*Base).Walk" trace.id=3207cb4a-8cde-455e-b28f-9551bce2bf01 trace.line=230
2024-09-09T17:04:41.078979933+08:00 panic: runtime error: index out of range [0] with length 0
2024-09-09T17:04:41.078979933+08:00 
2024-09-09T17:04:41.078979933+08:00 goroutine 102 [running]:
2024-09-09T17:04:41.078979933+08:00 github.com/distribution/distribution/v3/registry/storage.getOutstandingUploads.func1({0x29c8620, 0xc000e6a980})
2024-09-09T17:04:41.078979933+08:00 	/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/purgeuploads.go:73 +0x54c
2024-09-09T17:04:41.078979933+08:00 github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk.func1(0xc000e134d0, 0x0?)
2024-09-09T17:04:41.078979933+08:00 	/go/src/github.com/openshift/image-registry/vendor/github.com/distribution/distribution/v3/registry/storage/driver/s3-aws/s3.go:1171 +0x45c
2024-09-09T17:04:41.079053850+08:00 github.com/aws/aws-sdk-go/service/s3.(*S3).ListObjectsV2PagesWithContext(0xc000c08010, {0x29c8498?, 0xc000629ea0}, 0xc000629e30, 0xc0006cba08, {0x0, 0x0, 0x0})
2024-09-09T17:04:41.079053850+08:00 	/go/src/github.com/openshift/image-registry/vendor/github.com/aws/aws-sdk-go/service/s3/api.go:7629 +0x1d0
2024-09-09T17:04:41.079053850+08:00 github.com/distribution/distribution/v3/registry/storage/driver/s3-aws.(*driver).doWalk(0xc0006242d0, {0x29c8498, 0xc000629dc0}, 0xc000a4dab8, {0xc0014bec61, 0x20}, {0x2999cb8, 0x1}, 0xc000ed4f90)
[...]

What that debug log is telling me is the following:

I havent dug in more, my time is expensive, so feel free. My suspicion is either strings.Replace or directoryDiff causing some unexpected havoc

Hey Milos, thanks for having a look!

seems like you're not using the latest release?

Yes, we're a bit behind - we're using a pre-alpha v3 from around June 2023.

it would appear that the repositories keyspace is either empty ๐Ÿค” or there is something fishy with

I got a heavily edited list of the customer's storage from support which indicates it isn't empty:

docker/registry/v2/blobs
docker/registry/v2/repositories/<project_name>/<image_stream>/_layers
docker/registry/v2/repositories/<project_name>/<image_stream>/_manifests
docker/registry/v2/repositories/<project_name>/<image_stream>/_uploads

which leads to storagedriver.FileInfoFields.Path being set to "" which then leads to [...]

True. Either "" or ending in / would make path.Split return an empty file ๐Ÿค”
Unfortunately it's unclear from the output whether they have any actual contents in their registry or not. I've asked for clarification.

True. Either "" or ending in / would make path.Split return an empty file ๐Ÿค”

Yeah as I said in one of my earlier comments #4358 (comment)

I wasnt able to repro it the last time I tried but I had a quick look again just now and I zoomed in on

filePath := strings.Replace(*file.Key, d.s3Path(""), prefix, 1)
which can return filePath ending with / which would then be set in

which could then trigger the panic in PurgeUploads (due to path.Split). Maybe ๐Ÿค” That code (Walk) is a bit too complex for my liking

See a small playground here

That'd imply though that ListObjectsV2PagesWithContext would return a key ending with / so this might be a red herring ๐Ÿคทโ€โ™‚๏ธ Needs someone to spend proper time on this. I just skim it here and there in b/w paid work

I appreciate your time and input @milosgajdos. I also haven't really had a chance to give this the attention it needs.
I'm thinking I'll take someone's suggestion of trying to reproduce this in the s3 driver tests, hopefully sometime this or next week (if someone doesn't beat me to it).