metaschema predicates pollute salad document namespace
mr-c opened this issue · 8 comments
Alternative title: CWL enum symbols get fully expanded instead of being left as pure strings
Source issue: common-workflow-language/cwljava#70
For example, CommandInputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/CommandLineTool.yml#L355 extends InputEnumSchema https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L689 which extends sld:EnumSchemahttps://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/salad/schema_salad/metaschema/metaschema_base.yml#L135 which defines the symbols field as follows:
type: string[]
jsonldPredicate:
_id: "sld:symbols"
_type: "@id"
identity: true
doc: "Defines the set of valid syThe _type: "@id" and identity: true might be fine for the metaschema of Schema Salad, but is inappropriate for a CWL Enum definition, where we expect plain strings.
cwltool ignores this, but codegen readers of CWL (cwljava, cwl-utils) cannot.
(all of this is also true for the Output and non-Command versions of EnumSchema in CWL)
I can't find a way to override this with a specializeFrom/To nor with overlapping field definitions, nor with re-rooting the CWL schema to not derive from sld:EnumSchema as the metaschema predicates collide with the CWL schema predicates
schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != 'https://w3id.org/cwl/cwl#InputEnumSchema/symbols'
(even if I use jsonldPredicate: { _id: "sld:symbols" } ):
schema_salad.exceptions.SchemaException: Predicate collision on symbols, '{'@id': 'https://w3id.org/cwl/salad#symbols', '@type': '@id', 'identity': True}' != '{'@id': 'https://w3id.org/cwl/salad#symbols'}'
Fix options:
- Hack the schema (fixes the codegen, but introduces big problems down the line): Re-root the CWL schema to not extend
sld:EnumSchemaand renamesymbolsfrom the metaschema to be something else (and update all other schema salad documents to use this new name in their definitions) - Hack the codegen (manually change the type of the symbols field post generation): would unblock common-workflow-language/cwljava#70 but it not sustainable
- Scope the predicates situation so that schema-salad documents themselves can use any predicate name from the metaschema, and then apply fix 1 without renaming the existing use of
symbolsin the metaschema. I like this, but I don't know how to implement it yet - Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL
type: enums.
Another option that might work: remove https://github.com/common-workflow-language/cwl-v1.2/blob/a0f2d38e37ff51721fdeaf993bb2ab474b17246b/Process.yml#L15 and copy in the salad metaschema parts we need, making the one change to the symbols definition
The TypeScript version has the same issue: https://runkit.com/zimmera/61e180e8b299c2000851e088 (output expandable at bottom)
@ZimmerA Thanks. Yeah, it is a schema issue; see common-workflow-language/cwl-v1.2#145 for my proposed fix
Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.
Enums are not intended to be pure strings, they're intended to be qualified URIs, but part of the vocabulary so you're able to use the shortened version. It sounds like it is behaving as designed, but the design is surprising.
I agree for Schema Salad, but not for CWL.
CWL uses enums to create command line interfaces
https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't say that we only use the short versions for the command line (but in fact cwltool does so)
So I think in practice the concepts are different.
@tetron common-workflow-language/cwltool#1591 shows that there is no cwltool unit test nor CWL conformance test that relies on enum symbols being qualified URIs; therefore I argue that they are factually not so and the schema proposal in common-workflow-language/cwl-v1.2#145 should be accepted for CWL v1.2.1 to confirm that.
Ok, I refreshed my recollection of what cwltool does.
cwltool has been operating under the premise that symbols are expanded to URIs for a long time, e.g.
$ cwltool --print-pre blah.cwl
INFO /home/peter/work/cwltool/venv3/bin/cwltool 3.1.20220103201022
INFO Resolved 'blah.cwl' to 'file:///home/peter/work/tmp/enum/blah.cwl'
{
"arguments": [
{
"prefix": "species",
"valueFrom": "$(inputs.first.species)"
},
{
"prefix": "ncbi_build",
"valueFrom": "$(inputs.first.ncbi_build)"
}
],
"baseCommand": "echo",
"class": "CommandLineTool",
"cwlVersion": "v1.2",
"id": "file:///home/peter/work/tmp/enum/blah.cwl",
"inputs": [
{
"id": "file:///home/peter/work/tmp/enum/blah.cwl#first",
"type": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params"
}
],
"outputs": [
{
"id": "file:///home/peter/work/tmp/enum/blah.cwl#result",
"outputBinding": {
"glob": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
},
"type": "File"
}
],
"requirements": [
{
"class": "SchemaDefRequirement",
"types": [
{
"fields": [
{
"name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build",
"type": [
"null",
{
"symbols": [
"file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh37",
"file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCh38",
"file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/ncbi_build/GRCm38"
],
"type": "enum"
}
]
},
{
"name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species",
"type": [
"null",
{
"symbols": [
"file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/homo_sapiens",
"file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params/species/mus_musculus"
],
"type": "enum"
}
]
}
],
"name": "file:///home/peter/work/tmp/enum/blah.cwl#vcf2maf_params",
"type": "record"
}
]
}
],
"stdout": "8e100d950d274f5732d81f3ac0bf3712b420fa07"
}
What happens is that for the purposes of validation, it gets the short name of the URI. How to get the short name is described in the schema salad spec:
https://www.commonwl.org/v1.2/SchemaSalad.html#Short_names
cwl-utils / cwljava should provide a conforming "shortname" function if it doesn't already.
The python_codegen itself also assumes enum symbols are URIs, and calls safe_name (which computes the short name).
return self.declare_type(
TypeDef(
self.safe_name(type_declaration["name"]) + "Loader",
'_EnumLoader(("{}",))'.format(
'", "'.join(
self.safe_name(sym)
for sym in type_declaration["symbols"]
)
),
)
)
From my perspective, the correct thing to do is 4
Do nothing. Consumers of the CWL codegen libraries will be forced to subtract out the ID of the parent object to get the true values of CWL type: enums.
I feel that the proposed solution (making enum symbols unparsed strings) is effectively redefining the semantics of the enum data model which is inappropriate for a 1.2.1 version. It would be better to ensure existing validation and Command Line behavior is properly specified (i.e. derive the short name from the URI and check the string against that) for the purposes of checking the source document.
I agree that https://www.commonwl.org/v1.2/CommandLineTool.html#CommandLineBinding doesn't explicitly describe behavior for enums. However, in practice the current behavior is that it accepts short name as a string, and then the command line behavior is identical as for a string.