Enhance OCI Artifact support

There is limited support for OCI artifacts that hadn't been built with Flux CLI.

One can use spec.layerSelector.mediaType, however it doesn't handle searching images with an index.
There is also spec.layerSelector.operation that can be set to copy or extract.

There are also some non-configurable assumptions:

it's only possible to select either just the first layer or the first one that matches the given media type
spec.layerSelector.operation: copy writes files with .tgz extension, layer content be any kind of format, e.g. JSON

Proposal:

allow serach from index, with support for nested indecies also
change ops constant names to Copy and Extract to match Kubernetes style
provide some way of setting name of the destination files for Copy, by default it could be derrived from the media type name, e.g. tar+gz vs json+gz
allow specifying layer offset with default being 0
allow exacting multiple layers, perhaps this could be accomplished with operation: MergeDir and operation: LayerPerDir

Some of these changes may be more complex than others, just opening this for a discussion.

allow serach from index, with support for nested indecies also

This may need additional parameters, e.g. it's possible that an image has two components that have the same media types, we will need a way to decide which one to pick. We might have to just error out.
Using an offset like with layers won't suffice in this case, as order is not an inherent property of an index, unlike with layers. Annotation could be used, but that requires modifying the image, which may not be possible, so perhaps an error message would be the best start. That error message could suggest the user set a reference to an exact digest instead.

provide some way of setting name of the destination files for Copy

The key question to me here is how names and formats of files are surfaced in the interface with other controllers. Is kustomize-controller expecting <revision>.tgz files?

This may need additional parameters, e.g. it's possible that an image has two components that have the same media types, we will need a way to decide which one to pick. We might have to just error out.

Care to explain a bit more? I don't understand this.

allow serach from index, with support for nested indecies also

This could help where several artifacts form a product to be deployed. Operators could provide an index with all artifacts. With the other proposed, we could then have a ready directory to be passed to downstream consumers.

I think first we should optimize the pulling, instead of downloading the whole artifact like we do here

source-controller/internal/controller/ocirepository_controller.go

Line 458 in 33dd859

img, err := remote.Image(ref, opts...)

We should get the manifest, and download only the selected layer.

I think first we should optimize the pulling, instead of downloading the whole artifact...

@stefanprodan it's possibe that just using remote.Image instead of crane.Pull (per #1244) might have solved that issue already, let's verify that. What I inteded with this issue is to start a discussion about the functionality the API could offer for supporting wide range of images.

@errordeveloper you're right, the comment was left in place from when we used crane, but it doesn't pull anything else but the manifest, following the code from remote.Image it ends up doing:

	u := f.url("manifests", ref.Identifier())
	req, err := http.NewRequest(http.MethodGet, u.String(), nil)
	if err != nil {
		return nil, nil, err
	}

This may need additional parameters, e.g. it's possible that an image has two components that have the same media types, we will need a way to decide which one to pick. We might have to just error out.

Care to explain a bit more? I don't understand this.

It's in the case of what do you do if index points at several artefacts that all match selection criteria. Using a directory per your suggestions

allow serach from index, with support for nested indecies also

This could help where several artifacts form a product to be deployed. Operators could provide an index with all artifacts. With the other proposed, we could then have a ready directory to be passed to downstream consumers.

My primary use-case for this one was image with attestations stored along with configs. But it can be what you descirbed also, for sure, but I am not sure what sort of directory structure could be derived. I guess this is also similar to the idea of merge operation for multiple layers...
Maybe we could use annotations for directory names, but that will mean images will need to built in a very specific ways with those annotations.

On the other hand, it might be wise to put composition out of scope, maybe there is room for another abstractions like RepositoryComposition?

@stefanprodan had there been requests about composing Git repos? Are submodules the go-to recommendation?

I guess that there is fundamental difference with OCI, in the way that you can staff anything into it without meanigful path keys where uniqueness is guaranteed. With Git, every object has a meaningful and unique paths. With OCI it's just hashes and paths are only defined within a tarball, which is below OCI spec, so there can be paths conflicts, if you have an image with multiple tarballs.

had there been requests about composing Git repos? Are submodules the go-to recommendation?

Yes, see my answer here: #854 (comment)

operation: MergeDir

I vote for operation: ExtractAll

With OCI it's just hashes and paths are only defined within a tarball, which is below OCI spec, so there can be paths conflicts, if you have an image with multiple tarballs.

For Timoni modules, I opted to use the same media type for all content layers, and at pull time, the filesystem gets unified.

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.manifest.v1+json",
  "config": {
    "mediaType": "application/vnd.timoni.config.v1+json",
    "size": 342,
    "digest": "sha256:4808b5d01cb574e2d5c823e3af98f41af2e4965e87ddb90a3df311a47e302b0a"
  },
  "layers": [
    {
      "mediaType": "application/vnd.timoni.content.v1.tar+gzip",
      "size": 189009,
      "digest": "sha256:d865c7899bccf1bce740ccfbfa4513f9f97acbdb7c03fbc60dd2153ee9157a68",
      "annotations": {
        "sh.timoni.content.type": "module/vendor"
      }
    },
    {
      "mediaType": "application/vnd.timoni.content.v1.tar+gzip",
      "size": 4490,
      "digest": "sha256:1ccb6accebd348dee5047de0316e229156525487b86611945c148fda9e5dfc22",
      "annotations": {
        "sh.timoni.content.type": "module"
      }
    }
  ],
  "annotations": {
    "org.opencontainers.image.created": "2023-09-28T23:58:44+03:00",
    "org.opencontainers.image.revision": "1.1.0",
    "org.opencontainers.image.source": "https://github.com/stefanprodan/timoni.git"
  }
}

If source-controller would unpack all the application/vnd.timoni.content.v1.tar+gzip layers in the same dir, then it would do the same thing as Timoni and the resulting Flux artifact would be Ok. The layer annotations here are just hints to humans, but the actual files inside will never overlap, the sha256:d865c7899bccf1bce740ccfbfa4513f9f97acbdb7c03fbc60dd2153ee9157a68 layer contains only the cue.mod dir, and the sha256:1ccb6accebd348dee5047de0316e229156525487b86611945c148fda9e5dfc22 layer the templates, values, etc.

I would really like for Flux to be able to download and extract all the layers based on the media type selector, in this case it would be application/vnd.timoni.content.v1.tar+gzip.

Push code is here https://github.com/stefanprodan/timoni/blob/a27dada5ab6e3bb356f343edd1c675ff75f34ca4/internal/oci/push_module.go#L42