spdx/spdx-3-model

`hasDataFile` would benefit from a better description

Closed this issue · 10 comments

The following line does not help to understand what it means and how it should be used. It is missing context and data file is a very generic term.

- hasDataFile: The `from` Element treats each `to` Element as a data file

Looks like it was just copied from the SPDX 2.X definitions.

@maxhbr If you want to create a PR, we can get this into 3.0, otherwise I'll target it for 3.1

If someone can explain to me, what a "data file" is, sure. But then it might even be easier to create the PR instead of explaining it to me.

It has been a very long time since this was discussed.

I'll move it to 3.1 unless someone wants to volunteer to write a PR.

@rgopikrishnan91 - do you want to take a pass at this description?

bact commented

In this example SBOM, I use it to documented that a classifier script has a classifier model (data file).

[ predict.py ] --hasDataFile--> [ model.bin ]

{
    "type": "Relationship"
    "relationshipType": "hasDataFile",
    "from": "https://spdx.org/spdxdocs/File11",
    "to": [
        "https://spdx.org/spdxdocs/File10"
    ],
}
bact commented

Open PR #815 to get more suggestions

I left a review on the PR, but here are my thoughts

Generic database fie --> File
I am unsure if saying hasDataFile is the right name? Should it hasAsset? because hasDataFile for a model executable and log file seems odd and non-intitutive.
Also, I am unsure and maybe we should clarify how its different from dependsOn relationship?

@bact @kestewart

Should it hasAsset?

In other parts of the SPDX spec, we've used the term Artifact to mean something more general than a file.

bact commented

I agree with @rgopikrishnan91 that we should clarify how hasDataFile (or hasAsset / hasArtifact) is different from dependsOn.

I can see that hasDataFile doesn't imply dependency, while dependsOn explicitly does.

hasDataFile also suggests that the to Element should be a File (although I don't think it is enforced by the model), while the to of dependsOn can be anything.

In this sense, dependsOn works more at the abstract level. While hasDataFile details the implementation level.

bact commented

Yesterday AI team meeting (2024-07-31), we settled with this:

  • hasDataFile: The from Element treats each to Element as a data file. A data file is an artifact that stores data required or optional for the from Element's functionality. A data file can be a database file, an index file, a log file, an AI model file, a calibration data file, a temporary file, a backup file, and more. For AI training dataset, test dataset, test artifact, configuration data, build input data, and build output data, please consider using the more specific relationship types: trainedOn, testedOn, hasTest, configures, hasInputs, and hasOutputs, respectively. This relationship does not imply dependency.

(see #815) @maxhbr do you think this is sufficient? thank you