GREsau/schemars

Support generating multiple schemas (for serialize vs deserialize) from the same struct

untitaker opened this issue · 9 comments

I have a lot of types whose deserialization behavior allows for more types than the serialization does. For example:

// Deserialize is implemented such that any non-string JSON object is stringified
#[derive(Serialize)]
struct LenientString(String);

#[derive(JsonSchema, Serialize, Deserialize)]
struct Foo {
    bar: LenientString
}

Therefore I would like to generate two schemas per type: A schema describing the input that is allowed, and a schema that describes the output that is guaranteed.

Is this kind of feature something you'd be interested in merging? I am on the fence as to whether this is sensible. In some languages I would've created separate input and output types, but as serde allowed me to do everything with one type this is what I ended up with. But I think that adding this feature to schemars is easier than refactoring my existing codebase.

Interesting idea, this could indeed be useful. I would certainly consider altering the behaviour of json_schema() to return the "deserialize" schema, and having a separate function to return the "serialize" schema (or vice versa). My main concern would be how much this increases the output binary size and/or compile times, but it may not be a big problem. If it is, it may be worth making it optional via a feature flag

This would also fix #25, since aliases are only significant when deserializing

Has there been any discussion on how this could be implemented?

As far as I can tell, generating examples is done via Serde serialize. This would mean, it's necessary to create at least one copy (as in generating a new struct at compiletime in a macro, not copying in the sense of struct.copy()) of the struct that's deriving JsonSchema to generate proper examples when #[serde(skip_serializing)] is used. Otherwise those fields would be missing in a deserialize schema example.

Some thoughts I've had on how to deal with this:

Related issues:

The current state of play

Example 1

#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct {
    #[serde(alias = "my_i32")]
    pub my_int: i32,
    pub my_bool: bool,
}

The current schema produced by schemars is:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_bool": {
      "type": "boolean"
    },
    "my_int": {
      "type": "integer",
      "format": "int32"
    }
  },
  "required": ["my_int", "my_bool"]
}

This schema does not accurately describe the deserialization contract of the type when using serde_json, because it would not successfully validate a value using the my_i32 alias even though it would be accepted by serde, e.g.

{
  "my_i32": 123,
  "my_bool": false
}

While the schema does "correctly" describe the serialization contract of the type (in that all possible serialized values would successfully validate against the schema), the schema is not "specific" because it also allows unknown properties. The schema could be made more specific by including "additionalProperties": false (or "unevaluatedProperties": false), but this would arguably make the schema diverge further from the correct deserialization contract, which accepts (but ignores) unknown properties.

Example 2

#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct2 {
    #[serde(skip_serializing)]
    pub write_only: i32,
    #[serde(skip_deserializing)]
    pub read_only: bool,
}
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct2",
  "type": "object",
  "properties": {
    "read_only": {
      "type": "boolean",
      "default": false,
      "readOnly": true
    },
    "write_only": {
      "type": "integer",
      "format": "int32",
      "writeOnly": true
    }
  },
  "required": ["write_only"]
}

This uses the readOnly and writeOnly JSON schema keywords, which are part of the meta-data vocabulary, so (as far as I know) have no effect when validating a value against a schema.

This schema arguably does not quite accurately describe the deserialization contract because it restricts the read_only property to a boolean, even though a value like {"read_only":"string", "write_only":0} would be allowed by serde (but only because the read_only property string value is treated the same as an unknown keyword).

And this schema does not accurately describe the serialization contract because it includes write_only as a required property, even though that property is never included in the serialized value, causing all possible serialized values to fail validation against the schema.

In summary: Schemars currently isn't sure whether it should be describing the serialization or deserialization contract for a type, so it ends up doing both of them quite badly.

Potential Solution - Extending SchemaSettings

When creating a SchemaSettings, the consumer could select which contract they would like the generated schema to describe via new field/enum e.g.

pub struct SchemaSettings {
  /* snip: existing SchemaSettings fields... */

  pub contract: SchemaContract,
}

pub enum SchemaContract {
  Serialize,
  Deserialize,
}

let settings = SchemaSettings::draft2020_12().with(|s| {
    s.contract = SchemaContract::Deserialize;
});
// Or maybe add a helper method like:
// let settings = SchemaSettings::draft2020_12().for_deserialize();
let gen = settings.into_generator();
let schema = gen.into_root_schema_for::<MyStruct>();

contract is the first name that came to mind but I'm open to other ideas - maybe SerializeMode or something? Naming things is hard!

We would then need to update implementations of JsonSchema::json_schema() to respect the new setting where necessary - most of this work would be in schemars_derive.

SchemaContract::Serialize should be fairly straightforward - the generated schema should (this is more of a target than a guarantee):

  • successfully validate any instance of the type having been serialized to JSON
  • fail validation for any JSON value for which no instance of the rust type would serialize to

In other words, the Serialize schema generated for a type T should successfully validate a JSON value if and only if there exists some possible value of T that serializes to that JSON value.

But what would SchemaContract::Deserialize mean? The most objective way to define it would be that the generated schema should successfully validate a JSON value if and only if serde_json would successfully deserialize that value to the relevant rust type. This would then continue to allow unknown fields for structs without the deny_unknown_fields attribute.

But should SchemaContract::Deserialize be stricter? Should it try to just match values that "usefully" deserialize, rather than ones that "successfully" deserialize? Or maybe it should be configurable - we could do this by any of:

  • Splitting SchemaContract::Deserialize into separate SchemaContract::DeserializeLoose and SchemaContract::DeserializeStrict
  • Making SchemaContract::Deserialize a struct-style variant with a deny_unknown_fields flag
  • Adding a deny_unknown_fields flag to SchemaSettings, where the flag is only relevant when using SchemaContract::Deserialize

Alternative designs

Separate flags on SchemaSettings

pub struct SchemaSettings {
  /* snip: existing SchemaSettings fields... */

  /// Whether `skip_serializing` fields are included in schema
  pub include_readonly: bool,

  /// Whether `skip_deserializing` fields are included in schema
  pub include_writeonly: bool,

  /// Whether the schema accepts aliases as property names
  pub include_aliases: bool,

  /// If `true`, schema will not allow properties that would be ignored during deserialization,
  /// even if they would not cause deserialization to fail
  pub deny_unknown_fields: bool,
}

This gives more control to consumers about exactly what they want in the generated schema, but with more control comes more complexity. The flags in the example above wouldn't even be sufficient, e.g. they don't surface any way of specifying whether the schema should use the "serialize" or "deserialize" name for fields that set them differently.

Separate trait methods on JsonSchema

Instead of adding new properties to SchemaSettings, we could have separate methods on the JsonSchema trait, e.g.

pub trait JsonSchema {
  fn json_schema_for_deserialize(gen: &mut SchemaGenerator) -> Schema;
  fn json_schema_for_serialize(gen: &mut SchemaGenerator) -> Schema;
}

...and the existing json_schema method could either be removed, or kept as generating a "hybrid" schema as it does today.

I don't like this option mostly because it practically doubles the amount of work it takes to implement JsonSchema (whether it's manually implemented or macro-derived), increasing code complexity and compile times. Even when deriving JsonSchema on a type with no special serialize/deserialize behaviour (i.e. no alias/skip_serializing attributes etc.), it would need two separate implementations in order to call the correct json_schema_for_xxx method for each of its fields.

So, let's say for sake of argument that we go with extending the schemars API like so:

pub struct SchemaSettings {
  /* snip: existing SchemaSettings fields... */

  pub contract: SchemaContract,
  pub deny_unknown_fields: bool,
}

pub enum SchemaContract {
  Serialize,
  Deserialize,
}

Then what would the schema for the examples in the previous comment look like?

Example 1

#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct {
    #[serde(alias = "my_i32")]
    pub my_int: i32,
    pub my_bool: bool,
}

When using SchemaContract::Serialize:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_int": {
      "type": "integer",
      "format": "int32"
    },
    "my_bool": {
      "type": "boolean"
    }
  },
  "required": [
    "my_int",
    "my_bool"
  ],
  "additionalProperties": false /* could use "unevaluatedProperties" instead of "additionalProperties" */
}

When using SchemaContract::Deserialize with deny_unknown_fields: false:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_bool": {
      "type": "boolean"
    }
  },
  "oneOf": [
    {
      "properties": {
        "my_int": {
          "type": "integer",
          "format": "int32"
        }
      },
      "required": [
        "my_int"
      ]
    },
    {
      "properties": {
        "my_i32": {
          "type": "integer",
          "format": "int32"
        }
      },
      "required": [
        "my_i32"
      ]
    }
  ],
  "required": [
    "my_bool"
  ]
}

Or, it could include the alias schemas in the top-level properties:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_int": {
      "type": "integer",
      "format": "int32"
    },
    "my_i32": {
      "type": "integer",
      "format": "int32"
    },
    "my_bool": {
      "type": "boolean"
    }
  },
  "oneOf": [
    {
      "required": [
        "my_int"
      ]
    },
    {
      "required": [
        "my_i32"
      ]
    }
  ],
  "required": [
    "my_bool"
  ]
}

When using SchemaContract::Deserialize with deny_unknown_fields: true, the only difference is that we would need to set additionalProperties or unevaluatedProperties. Putting the alias schemas inside the oneOf means it would have to be unevaluatedProperties:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_bool": {
      "type": "boolean"
    }
  },
  "oneOf": [
    {
      "properties": {
        "my_int": {
          "type": "integer",
          "format": "int32"
        }
      },
      "required": [
        "my_int"
      ]
    },
    {
      "properties": {
        "my_i32": {
          "type": "integer",
          "format": "int32"
        }
      },
      "required": [
        "my_i32"
      ]
    }
  ],
  "required": [
    "my_bool"
  ],
  "unevaluatedProperties": false
}

By putting them in the top-level properties, we can use additionalProperties OR unevaluatedProperties:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "my_int": {
      "type": "integer",
      "format": "int32"
    },
    "my_i32": {
      "type": "integer",
      "format": "int32"
    },
    "my_bool": {
      "type": "boolean"
    }
  },
  "oneOf": [
    {
      "required": [
        "my_int"
      ]
    },
    {
      "required": [
        "my_i32"
      ]
    }
  ],
  "required": [
    "my_bool"
  ],
  "additionalProperties": false  /* could use "unevaluatedProperties" instead of "additionalProperties" */
}

Example 2

#[derive(JsonSchema, Serialize, Deserialize)]
pub struct MyStruct2 {
    #[serde(skip_serializing)]
    pub write_only: i32,
    #[serde(skip_deserializing)]
    pub read_only: bool,
}

When using SchemaContract::Serialize, the write_only would be omitted entirely:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "read_only": {
      "type": "boolean",
      "default": false,
      "readOnly": true
    }
  },
  "required": [
    "read_only"
  ],
  "additionalProperties": false  /* could use "unevaluatedProperties" instead of "additionalProperties" */
}

Conversely, SchemaContract::Deserialize would omit read_only. So with deny_unknown_fields: false:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "write_only": {
      "type": "integer",
      "format": "int32",
      "writeOnly": true
    }
  },
  "required": [
    "write_only"
  ]
}

And with deny_unknown_fields: true:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "MyStruct",
  "type": "object",
  "properties": {
    "write_only": {
      "type": "integer",
      "format": "int32",
      "writeOnly": true
    }
  },
  "required": [
    "write_only"
  ],
  "additionalProperties": false  /* could use "unevaluatedProperties" instead of "additionalProperties" */
}

Although something else to keep in mind is how to support use-cases that want to generate schemas for both the "serialize" and "deserialize" contract, for example when generating an openapi document.

Currently, there's no way to modify the SchemaSettings once they become part of a SchemaGenerator. But I don't see any major problems with introducing an method like SchemaGenerator::settings_mut(&mut self) -> &mut SchemaSettings, which would then allow generating some "serialize" schemas and some "deserialize" schemas with the same generator (by mutating its settings).

But we would still have a problem if you try to generate both a "serialize" and "deserialize" schema for the same type - the first one you generate would add the schema to the definitions ($defs), and then any future schemas for that type would just re-use the earlier schema via a $ref, even if the contract has changed. So, we would also need to make the generator "remember" the contract that was used for each schema that has been added to its definitions. I don't think that would be too hard - it should mostly just be altering the schema_id_to_name field to be keyed off of a tuple containing both the schema ID and the contract. A minor downside of this strategy would be that generating both a "serialize" and "deserialize" schema for the same type would add it to the definitions twice, even if those schemas happen to be identical.

Changes requires in schemars_derive:

  • For "serialize" schemas, act as if structs always have deny_unknown_fields enabled
  • Use field/variant serialize_name()/deserialize_name() as appropriate (schemars currently always uses deserialize_name)
    • optimisation: only include the conditional in the derive output when the names are actually different (they usually won't be)
  • exclude field/variants with skip_serializing/skip_deserializing attributes as appropriate
  • For "serialize" schemas, do not add fields with skip_serializing_if to required (but DO add schemas with default)
  • For "deserialize" schemas, do not add fields with default to required (but DO add schemas with skip_serializing_if)
  • When a struct has a from/into (or try_from) serde attribute, forward the "deserialize"/"serialize" (respectively) schema generation to that type
  • For "deserialize" schemas, add aliases to the schema's allowed properties
    • aliases are mutually exclusive - i.e. the schema should disallow objects with both the inner property name AND an alias. What's the best way to represent this in JSON schema, particularly for optional properties?

I like how you started with the conceptual definitons. Probably makes sense to have clarity on those “guiding principles” before moving to specific API decisions like &mut settings.

Re. “the Serialize schema generated for a type T should successfully validate a JSON value if and only if there exists some possible value of T that serializes to that JSON value”, there’s also the component of change over time :) additionalProperties: false is definitely not a guarantee I would want to make to my API consumers.

I wonder if a instead of Serialize/DeserializeStrict/DeserializeLoose there can just be a dozen different settings and then perhaps a few presets for them as a starting point/common use cases/best practices, but ultimately not super deeply permeated through the whole system and easy to evolve/customize. Is that a cop out?

Re. “the Serialize schema generated for a type T should successfully validate a JSON value if and only if there exists some possible value of T that serializes to that JSON value”, there’s also the component of change over time :) additionalProperties: false is definitely not a guarantee I would want to make to my API consumers.

That's interesting, I hadn't considered that scenario. Although obviously, we have to limit what we say would be "acceptable" as a change over time. Adding new properties is probably reasonable, hence we can allow unknown properties generally (as we do today for structs without deny_unknown_fields). Changing the type of an existing property is probably less reasonably, because if we accepted that, then we would have to make every property in the schema valid for any type, which would be so unspecific as to make it pretty much useless.

So that specific scenario (allowing future/unknown properties) would be possible if we either had the SchemaSettings:deny_unknown_fields flag from the example above but made it also apply to "serialize" schemas. Or alternatively, rather than adding the flag at all, just keep the current behaviour of allowing unknown properties for types without the deny_unknown_fields attribute

I wonder if a instead of Serialize/DeserializeStrict/DeserializeLoose there can just be a dozen different settings and then perhaps a few presets for them as a starting point/common use cases/best practices

That's essentially the "Separate flags on SchemaSettings" alternative design I mentioned above. I'm concerned about how easy it would be to scale and reason about how they would all interact with each other given the high number of serde attributes that we need to consider. We would also need to consider how to deal with types that implement JsonSchema manually, without deriving it. e.g. BigDecimal is always serialized as a string, but can be deserialized from a string or a number - so which setting would control whether the schema allows it to be a number?

This is now implemented in 1.0.0-alpha.15, and documented at https://graham.cool/schemars/generating/#serialize-vs-deserialize-contract. TLDR: by default it generates "deserialize" schemas, if you instead want a "serialize" schema then do this:

let schema = SchemaSettings::default()
        .for_serialize()
        .into_generator()
        .into_root_schema_for::<MyStruct>();

However, #[derive(JsonSchema)] does not yet handle the alias attribute on variants or fields! Adding those enhancements will be tracked in #25 and #338, respectively. Supporting it for enum variants should be relatively straightforward, since a variant alias can behave as just another variant with the same "payload" as the original variant. But supporting it on fields is a bit more complex, and so I would be interested in hearing what behaviour consumers would want in this scenario.