Support generating multiple schemas (for serialize vs deserialize) from the same struct
untitaker opened this issue · 9 comments
I have a lot of types whose deserialization behavior allows for more types than the serialization does. For example:
// Deserialize is implemented such that any non-string JSON object is stringified
#[derive(Serialize)]
struct LenientString(String);
#[derive(JsonSchema, Serialize, Deserialize)]
struct Foo {
bar: LenientString
}
Therefore I would like to generate two schemas per type: A schema describing the input that is allowed, and a schema that describes the output that is guaranteed.
Is this kind of feature something you'd be interested in merging? I am on the fence as to whether this is sensible. In some languages I would've created separate input and output types, but as serde allowed me to do everything with one type this is what I ended up with. But I think that adding this feature to schemars is easier than refactoring my existing codebase.
Interesting idea, this could indeed be useful. I would certainly consider altering the behaviour of json_schema()
to return the "deserialize" schema, and having a separate function to return the "serialize" schema (or vice versa). My main concern would be how much this increases the output binary size and/or compile times, but it may not be a big problem. If it is, it may be worth making it optional via a feature flag
This would also fix #25, since aliases are only significant when deserializing
Has there been any discussion on how this could be implemented?
As far as I can tell, generating examples is done via Serde serialize. This would mean, it's necessary to create at least one copy (as in generating a new struct at compiletime in a macro, not copying in the sense of struct.copy()
) of the struct that's deriving JsonSchema
to generate proper examples when #[serde(skip_serializing)]
is used. Otherwise those fields would be missing in a deserialize schema example.
Some thoughts I've had on how to deal with this:
Related issues:
The current state of play
Example 1
#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct {
#[serde(alias = "my_i32")]
pub my_int: i32,
pub my_bool: bool,
}
The current schema produced by schemars is:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_bool": {
"type": "boolean"
},
"my_int": {
"type": "integer",
"format": "int32"
}
},
"required": ["my_int", "my_bool"]
}
This schema does not accurately describe the deserialization contract of the type when using serde_json, because it would not successfully validate a value using the my_i32
alias even though it would be accepted by serde, e.g.
{
"my_i32": 123,
"my_bool": false
}
While the schema does "correctly" describe the serialization contract of the type (in that all possible serialized values would successfully validate against the schema), the schema is not "specific" because it also allows unknown properties. The schema could be made more specific by including "additionalProperties": false
(or "unevaluatedProperties": false
), but this would arguably make the schema diverge further from the correct deserialization contract, which accepts (but ignores) unknown properties.
Example 2
#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct2 {
#[serde(skip_serializing)]
pub write_only: i32,
#[serde(skip_deserializing)]
pub read_only: bool,
}
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct2",
"type": "object",
"properties": {
"read_only": {
"type": "boolean",
"default": false,
"readOnly": true
},
"write_only": {
"type": "integer",
"format": "int32",
"writeOnly": true
}
},
"required": ["write_only"]
}
This uses the readOnly
and writeOnly
JSON schema keywords, which are part of the meta-data vocabulary, so (as far as I know) have no effect when validating a value against a schema.
This schema arguably does not quite accurately describe the deserialization contract because it restricts the read_only
property to a boolean, even though a value like {"read_only":"string", "write_only":0}
would be allowed by serde (but only because the read_only
property string value is treated the same as an unknown keyword).
And this schema does not accurately describe the serialization contract because it includes write_only
as a required property, even though that property is never included in the serialized value, causing all possible serialized values to fail validation against the schema.
In summary: Schemars currently isn't sure whether it should be describing the serialization or deserialization contract for a type, so it ends up doing both of them quite badly.
Potential Solution - Extending SchemaSettings
When creating a SchemaSettings
, the consumer could select which contract they would like the generated schema to describe via new field/enum e.g.
pub struct SchemaSettings {
/* snip: existing SchemaSettings fields... */
pub contract: SchemaContract,
}
pub enum SchemaContract {
Serialize,
Deserialize,
}
let settings = SchemaSettings::draft2020_12().with(|s| {
s.contract = SchemaContract::Deserialize;
});
// Or maybe add a helper method like:
// let settings = SchemaSettings::draft2020_12().for_deserialize();
let gen = settings.into_generator();
let schema = gen.into_root_schema_for::<MyStruct>();
contract
is the first name that came to mind but I'm open to other ideas - maybe SerializeMode
or something? Naming things is hard!
We would then need to update implementations of JsonSchema::json_schema()
to respect the new setting where necessary - most of this work would be in schemars_derive
.
SchemaContract::Serialize
should be fairly straightforward - the generated schema should (this is more of a target than a guarantee):
- successfully validate any instance of the type having been serialized to JSON
- fail validation for any JSON value for which no instance of the rust type would serialize to
In other words, the Serialize
schema generated for a type T
should successfully validate a JSON value if and only if there exists some possible value of T
that serializes to that JSON value.
But what would SchemaContract::Deserialize
mean? The most objective way to define it would be that the generated schema should successfully validate a JSON value if and only if serde_json would successfully deserialize that value to the relevant rust type. This would then continue to allow unknown fields for structs without the deny_unknown_fields
attribute.
But should SchemaContract::Deserialize
be stricter? Should it try to just match values that "usefully" deserialize, rather than ones that "successfully" deserialize? Or maybe it should be configurable - we could do this by any of:
- Splitting
SchemaContract::Deserialize
into separateSchemaContract::DeserializeLoose
andSchemaContract::DeserializeStrict
- Making
SchemaContract::Deserialize
a struct-style variant with adeny_unknown_fields
flag - Adding a
deny_unknown_fields
flag toSchemaSettings
, where the flag is only relevant when usingSchemaContract::Deserialize
Alternative designs
Separate flags on SchemaSettings
pub struct SchemaSettings {
/* snip: existing SchemaSettings fields... */
/// Whether `skip_serializing` fields are included in schema
pub include_readonly: bool,
/// Whether `skip_deserializing` fields are included in schema
pub include_writeonly: bool,
/// Whether the schema accepts aliases as property names
pub include_aliases: bool,
/// If `true`, schema will not allow properties that would be ignored during deserialization,
/// even if they would not cause deserialization to fail
pub deny_unknown_fields: bool,
}
This gives more control to consumers about exactly what they want in the generated schema, but with more control comes more complexity. The flags in the example above wouldn't even be sufficient, e.g. they don't surface any way of specifying whether the schema should use the "serialize" or "deserialize" name for fields that set them differently.
Separate trait methods on JsonSchema
Instead of adding new properties to SchemaSettings
, we could have separate methods on the JsonSchema
trait, e.g.
pub trait JsonSchema {
fn json_schema_for_deserialize(gen: &mut SchemaGenerator) -> Schema;
fn json_schema_for_serialize(gen: &mut SchemaGenerator) -> Schema;
}
...and the existing json_schema
method could either be removed, or kept as generating a "hybrid" schema as it does today.
I don't like this option mostly because it practically doubles the amount of work it takes to implement JsonSchema
(whether it's manually implemented or macro-derived), increasing code complexity and compile times. Even when deriving JsonSchema
on a type with no special serialize/deserialize behaviour (i.e. no alias
/skip_serializing
attributes etc.), it would need two separate implementations in order to call the correct json_schema_for_xxx
method for each of its fields.
So, let's say for sake of argument that we go with extending the schemars API like so:
pub struct SchemaSettings {
/* snip: existing SchemaSettings fields... */
pub contract: SchemaContract,
pub deny_unknown_fields: bool,
}
pub enum SchemaContract {
Serialize,
Deserialize,
}
Then what would the schema for the examples in the previous comment look like?
Example 1
#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct {
#[serde(alias = "my_i32")]
pub my_int: i32,
pub my_bool: bool,
}
When using SchemaContract::Serialize
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_int": {
"type": "integer",
"format": "int32"
},
"my_bool": {
"type": "boolean"
}
},
"required": [
"my_int",
"my_bool"
],
"additionalProperties": false /* could use "unevaluatedProperties" instead of "additionalProperties" */
}
When using SchemaContract::Deserialize
with deny_unknown_fields: false
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_bool": {
"type": "boolean"
}
},
"oneOf": [
{
"properties": {
"my_int": {
"type": "integer",
"format": "int32"
}
},
"required": [
"my_int"
]
},
{
"properties": {
"my_i32": {
"type": "integer",
"format": "int32"
}
},
"required": [
"my_i32"
]
}
],
"required": [
"my_bool"
]
}
Or, it could include the alias schemas in the top-level properties
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_int": {
"type": "integer",
"format": "int32"
},
"my_i32": {
"type": "integer",
"format": "int32"
},
"my_bool": {
"type": "boolean"
}
},
"oneOf": [
{
"required": [
"my_int"
]
},
{
"required": [
"my_i32"
]
}
],
"required": [
"my_bool"
]
}
When using SchemaContract::Deserialize
with deny_unknown_fields: true
, the only difference is that we would need to set additionalProperties
or unevaluatedProperties
. Putting the alias schemas inside the oneOf
means it would have to be unevaluatedProperties
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_bool": {
"type": "boolean"
}
},
"oneOf": [
{
"properties": {
"my_int": {
"type": "integer",
"format": "int32"
}
},
"required": [
"my_int"
]
},
{
"properties": {
"my_i32": {
"type": "integer",
"format": "int32"
}
},
"required": [
"my_i32"
]
}
],
"required": [
"my_bool"
],
"unevaluatedProperties": false
}
By putting them in the top-level properties
, we can use additionalProperties
OR unevaluatedProperties
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"my_int": {
"type": "integer",
"format": "int32"
},
"my_i32": {
"type": "integer",
"format": "int32"
},
"my_bool": {
"type": "boolean"
}
},
"oneOf": [
{
"required": [
"my_int"
]
},
{
"required": [
"my_i32"
]
}
],
"required": [
"my_bool"
],
"additionalProperties": false /* could use "unevaluatedProperties" instead of "additionalProperties" */
}
Example 2
#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct2 {
#[serde(skip_serializing)]
pub write_only: i32,
#[serde(skip_deserializing)]
pub read_only: bool,
}
When using SchemaContract::Serialize
, the write_only
would be omitted entirely:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"read_only": {
"type": "boolean",
"default": false,
"readOnly": true
}
},
"required": [
"read_only"
],
"additionalProperties": false /* could use "unevaluatedProperties" instead of "additionalProperties" */
}
Conversely, SchemaContract::Deserialize
would omit read_only
. So with deny_unknown_fields: false
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"write_only": {
"type": "integer",
"format": "int32",
"writeOnly": true
}
},
"required": [
"write_only"
]
}
And with deny_unknown_fields: true
:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "MyStruct",
"type": "object",
"properties": {
"write_only": {
"type": "integer",
"format": "int32",
"writeOnly": true
}
},
"required": [
"write_only"
],
"additionalProperties": false /* could use "unevaluatedProperties" instead of "additionalProperties" */
}
Although something else to keep in mind is how to support use-cases that want to generate schemas for both the "serialize" and "deserialize" contract, for example when generating an openapi document.
Currently, there's no way to modify the SchemaSettings
once they become part of a SchemaGenerator
. But I don't see any major problems with introducing an method like SchemaGenerator::settings_mut(&mut self) -> &mut SchemaSettings
, which would then allow generating some "serialize" schemas and some "deserialize" schemas with the same generator (by mutating its settings).
But we would still have a problem if you try to generate both a "serialize" and "deserialize" schema for the same type - the first one you generate would add the schema to the definitions ($defs
), and then any future schemas for that type would just re-use the earlier schema via a $ref
, even if the contract has changed. So, we would also need to make the generator "remember" the contract that was used for each schema that has been added to its definitions. I don't think that would be too hard - it should mostly just be altering the schema_id_to_name
field to be keyed off of a tuple containing both the schema ID and the contract. A minor downside of this strategy would be that generating both a "serialize" and "deserialize" schema for the same type would add it to the definitions twice, even if those schemas happen to be identical.
Changes requires in schemars_derive:
- For "serialize" schemas, act as if structs always have
deny_unknown_fields
enabled - Use field/variant
serialize_name()
/deserialize_name()
as appropriate (schemars currently always usesdeserialize_name
)- optimisation: only include the conditional in the derive output when the names are actually different (they usually won't be)
- exclude field/variants with
skip_serializing
/skip_deserializing
attributes as appropriate - For "serialize" schemas, do not add fields with
skip_serializing_if
torequired
(but DO add schemas withdefault
) - For "deserialize" schemas, do not add fields with
default
torequired
(but DO add schemas withskip_serializing_if
) - When a struct has a
from
/into
(ortry_from
) serde attribute, forward the "deserialize"/"serialize" (respectively) schema generation to that type - For "deserialize" schemas, add
alias
es to the schema's allowed properties- aliases are mutually exclusive - i.e. the schema should disallow objects with both the inner property name AND an alias. What's the best way to represent this in JSON schema, particularly for optional properties?
I like how you started with the conceptual definitons. Probably makes sense to have clarity on those “guiding principles” before moving to specific API decisions like &mut
settings.
Re. “the Serialize schema generated for a type T should successfully validate a JSON value if and only if there exists some possible value of T that serializes to that JSON value”, there’s also the component of change over time :) additionalProperties: false
is definitely not a guarantee I would want to make to my API consumers.
I wonder if a instead of Serialize
/DeserializeStrict
/DeserializeLoose
there can just be a dozen different settings and then perhaps a few presets for them as a starting point/common use cases/best practices, but ultimately not super deeply permeated through the whole system and easy to evolve/customize. Is that a cop out?
Re. “the Serialize schema generated for a type T should successfully validate a JSON value if and only if there exists some possible value of T that serializes to that JSON value”, there’s also the component of change over time :) additionalProperties: false is definitely not a guarantee I would want to make to my API consumers.
That's interesting, I hadn't considered that scenario. Although obviously, we have to limit what we say would be "acceptable" as a change over time. Adding new properties is probably reasonable, hence we can allow unknown properties generally (as we do today for structs without deny_unknown_fields
). Changing the type of an existing property is probably less reasonably, because if we accepted that, then we would have to make every property in the schema valid for any type, which would be so unspecific as to make it pretty much useless.
So that specific scenario (allowing future/unknown properties) would be possible if we either had the SchemaSettings:deny_unknown_fields
flag from the example above but made it also apply to "serialize" schemas. Or alternatively, rather than adding the flag at all, just keep the current behaviour of allowing unknown properties for types without the deny_unknown_fields
attribute
I wonder if a instead of Serialize/DeserializeStrict/DeserializeLoose there can just be a dozen different settings and then perhaps a few presets for them as a starting point/common use cases/best practices
That's essentially the "Separate flags on SchemaSettings" alternative design I mentioned above. I'm concerned about how easy it would be to scale and reason about how they would all interact with each other given the high number of serde attributes that we need to consider. We would also need to consider how to deal with types that implement JsonSchema
manually, without deriving it. e.g. BigDecimal
is always serialized as a string, but can be deserialized from a string or a number - so which setting would control whether the schema allows it to be a number?
This is now implemented in 1.0.0-alpha.15, and documented at https://graham.cool/schemars/generating/#serialize-vs-deserialize-contract. TLDR: by default it generates "deserialize" schemas, if you instead want a "serialize" schema then do this:
let schema = SchemaSettings::default()
.for_serialize()
.into_generator()
.into_root_schema_for::<MyStruct>();
However, #[derive(JsonSchema)]
does not yet handle the alias
attribute on variants or fields! Adding those enhancements will be tracked in #25 and #338, respectively. Supporting it for enum variants should be relatively straightforward, since a variant alias can behave as just another variant with the same "payload" as the original variant. But supporting it on fields is a bit more complex, and so I would be interested in hearing what behaviour consumers would want in this scenario.