oras-project/artifacts-spec

Proposal: Return all recursive references in referrer API

Closed this issue · 9 comments

Motivation

Right now it appears that in many use cases we consider all the artifacts referencing an image to be indivisible from the image. What this means is in scenarios where the image needs to be moved from system A to system B, all of the artifacts associated with the image need to be moved with the image.

During that move operation, discovering all the artifacts associated with the image can be a costly operation that involves many recursive calls to the referrer API. Consider the following example:

Problem

If image net-monitor:v1 has daily scans, and each scan is signed, there would be the following image structure:

net-monitor:v1
   -> scan-result-1
       --> scan-result-1-signature
   -> scan-result-2
       --> scan-result-2-signature
...
   -> scan-result-365
       --> scan-result-365-signature

To move net-monitor:v1 and all its associated artifacts, the following network calls would be needed

GET registry/v2/.../net-monitor:v1

GET oras/.../net-monitor:v1/referrers
GET registry/v2/.../scan-result-1
GET oras/.../scan-result-1/referrers
GET registry/v2/.../scan-result-1-signature
GET oras/.../scan-result-1-signature/referrers      // To make sure it has no children

GET registry/v2/.../scan-result-2
GET oras/.../scan-result-2/referrers
GET registry/v2/.../scan-result-2-signature
GET oras/.../scan-result-2-signature/referrers      

...

GET registry/v2/.../scan-result-365
GET oras/.../scan-result-365/referrers
GET registry/v2/.../scan-result-365-signature
GET oras/.../scan-result-365-signature/referrers 

A total of 731 referrers calls is needed just to move this one image. This is extremely computationally expensive and may cause livelocks or be exploited for DDoS attacks.

Proposed solution

The referrers API would accept a recursive=true query parameter.
When this is true, it would return all the artifacts transitively referencing an image in a flat list.

GET oras/.../net-monitor:v1/referrers&recursive=true

Result:
{
  "references": [
    {
      "digest": {scan-result-1.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.scan.result",
      "size": 312
    },
    {
      "digest": {scan-result-1-signature.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.signature",
      "size": 312
    },
...
   {
      "digest": {scan-result-365.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.scan.result",
      "size": 312
    },
    {
      "digest": {scan-result-365-signature.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.signature",
      "size": 312
    },
  ]
}

This would allow all 731 calls to this API to be shortened to just 1 call.

Given that flat response list, how should I reconstruct the original tree structure?

As an example, how could I determine that scan-result-365-signature refers to scan-result-365, and not scan-result-11?

Given that flat response list, how should I reconstruct the original tree structure?

As an example, how could I determine that scan-result-365-signature refers to scan-result-365, and not scan-result-11?

What's the user scenario behind it? I'm wondering if this could imply a new API, perhaps /index? My motivation was that I wanted to keep the current referrers response schema intact, and for use my use case I didn't need that info.

Some possibilities:

Add subject field:

{
      "subject": {scan-result-365.digest},
      "digest": {scan-result-365-signature.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.signature",
      "size": 312
},

Nested responses (new API?)

{
      "digest": {scan-result-365.digest},
      "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
      "artifactType": "oras.scan.result",
      "size": 312
      "children": 
      [{
           "digest": {scan-result-365-signature.digest},
           "mediaType": "application/vnd.cncf.oras.artifact.manifest.v1+json",
           "artifactType": "oras.signature",
           "size": 312
      }]
},

My motivation was that I wanted to keep the current referrers response schema intact, and for use my use case I didn't need that info.

Don't you need that information to construct the same tree structure in the target registry/repo, that you're copying into?

If you really don't care about the tree structure, why not just store them all as referring to the root element directly?

Don't you need that information to construct the same tree structure in the target registry/repo, that you're copying into?

If you really don't care about the tree structure, why not just store them all as referring to the root element directly?

As for that, that's implementation decisions that hasn't been made yet 😛

My thoughts are that since the copy is incrementally executed manifest by manifest, the target system could incrementally build up an index containing the tree structure as it parses each newly copied manifest.

Let's visualize the DAG as mentioned by @nelson-wu.

                +---------------+        +-----------------+
                | scan-result-1 |  ....  | scan-result-365 |
                +-+-----------+-+        +-+-------------+-+
                  |           |            |             |
                  |           |            |             |
                  v           v            v             v
+-----------------+-------+ +-+------------+-+ +---------+-----------------+
| scan-result-1-signature | | net-monitor:v1 | | scan-result-365-signature |
+-------------------------+ +----------------+ +---------------------------+

To copy the net-monitor:v1 along with its n referrer artifacts, we need to fetch 2n + 1 manifests / blobs anyway by the following requests.

GET /v2/net-monitor/manifests/<v1_digest>

GET /v2/net-monitor/manifests/<scan-result-1_digest>
GET /v2/net-monitor/blobs/<scan-result-1-signature_digest>
...
GET /v2/net-monitor/manifests/<scan-result-365_digest>
GET /v2/net-monitor/blobs/<scan-result-365-signature_digest>

Besides, clients need to recursively find the up edges / ancestors of net-monitor:v1 by n + 1 referrer API calls:

GET /oras/artifacts/v1/net-monitor/manifests/<v1_digest>/referrers

GET /oras/artifacts/v1/net-monitor/manifests/<scan-result-1_digest>/referrers
...
GET /oras/artifacts/v1/net-monitor/manifests/<scan-result-365_digest>/referrers

I think @nelson-wu is trying to convey the following n requests

GET /oras/artifacts/v1/net-monitor/manifests/<scan-result-1_digest>/referrers
...
GET /oras/artifacts/v1/net-monitor/manifests/<scan-result-365_digest>/referrers

in a single referrer API call

GET /oras/artifacts/v1/net-monitor/manifests/<v1_digest>/referrers

so that we can reduce overall 3n + 2 requests to 2n + 2 requests. @nelson-wu correct me if I interpret your idea wrongly.

Is the question related to ordering of the references?
While we’ll want to support pulling all references, it’s possible a client may want to limit how many referred artifacts are copied. You may only want the least n scans, or last n signatures. Or, the signatures from a given entity.
This implies some additional filtering and ordering.

Is the question related to ordering of the references? While we’ll want to support pulling all references, it’s possible a client may want to limit how many referred artifacts are copied. You may only want the least n scans, or last n signatures. Or, the signatures from a given entity. This implies some additional filtering and ordering.

If the limiting/filtering is motivated by performance concerns, yes. The idea behind this is users may not want to traverse the whole tree to find what they want out of performance concerns. We could have some sort of filter, returning top N of property X. This would allow clients to get a much smaller list of artifacts.

However I think the potential issue is we'd have to implement a limited set of filters that may or may not match client side needs. If customers want something more specific, i.e. "I want top N artifacts that have this annotation but not this other annotation sorted by date ascending" we'll have to end up implementing a whole SQL query engine.

I think we could provide a top-down view of the whole artifact graph, similar to a sitemap, and customers can decide for themselves what to pull. It would allow them to address their own performance concerns, and give them more freedom to do their own filtering.

Performance is always a good thing to think about, but is that the motivator here?

The 365+ scans on an image is interesting. Although, I'm wondering if we'd really even hit 365 as a total. How many images last that long, before they are rebuilt, and replaced with a newer version (tag)? Do we need to re-scan archived images that are maintained for compliance reasons, and no longer in deployment?

Are the images actually scanned every day? Or, would you do an initial scan, and catalog what's in the image? Then, when a new java vulnerability is discovered, the scanner takes its inventory and scans the java-based images. Inventorying on the SBOM is even more interesting. Should we optimize the scans, and have fewer, more accurate?

If you have a history of scans, do you need all of them? Or, just the last n from each signing authority?

I'd just suggest let's start small, and increment, based on specific use cases.
If we can sort on a created annotation, filter on a type, page the results, what's possible, and what isn't?

Closing for now. As we have more usage, we can reconsider, and reactivate.