Stranger6667/jsonschema-rs

Roadmap for 1.0

Closed this issue · 8 comments

This is a live document where I’d like to put my thoughts about the future 1.0 release.

Spec support

Surely all recent drafts will be nice to have but they could be added later on, as they won’t break any compatibility.

Public API

Validation

// Lazily-evaluated validation results
let result = jsonschema::validate(&schema, &instance).expect("Invalid schema");
let is_valid = result.is_valid();
// Iterate over errors
for error in result.errors() {
    println!("{}", error);
}
// Different output formats
let verbose: serde_json::Value = outcome.format(jsonschema::formats::Verbose);
let basic: serde_yaml::Value = outcome.format(jsonschema::formats::Basic);

Auto-detect spec:

// Non-blocking
let validator = jsonschema::Validator::from_schema(&schema)
    .await
    .expect("Invalid schema");
// Blocking
let validator = jsonschema::blocking::Validator::from_schema(&schema)
    .expect("Invalid schema");

Specific variant:

let validator = jsonschema::Draft4Validator::from_schema(&schema)
    .await
    .expect("Invalid schema");

Configure validator:

let validator = jsonschema::blocking::Validator::options()
    // I.e. a resolver that forbids references
    .with_resolver(MyResolver::new())
    // Custom validator for the "format" keyword
    .with_format("card_number", CardNumberFormat::new())
    // Completely custom behavior for the `my-keyword` keyword
    .with_keyword("my-keyword", CustomKeywordValidator::new(42))
    .build(&schema)
    .expect("Invalid schema");

Every validation result is lazy and is cached under the hood to avoid re-validation.

Generally, the wrapper need to have the access to schema, and having a reference there is ideal, but for the shortcut api like jsonschema::validate it should be owned.

Output formatting seems to be serde-specific, but maybe it could be possible to provide a way to use something else via a custom trait impl.

Another thing is that validation may fail because of schema error - this should be semantically separate from errors in the input instance.

Errors should be lifetime independent, or at least there should be a way to make them owned (there is a pub(crate) method in the current impl), otherwise moving errors around is painful.

Compilation & options

I am slightly concerned about using "compilation" as a reference for this step. It is a bit confusing in the context of code compilation, maybe "build" / "builder" / "building" would be clearer.

Also, I think removing with / should prefixes for options should make the process more readable. Example from another crate - Url::options().base_url(...).parse(...)

JSONSchema -> Validator. The current naming is not compliant with Rust's naming rules + JsonSchema will collide with schemars::JsonSchema which is often used in the same codebase. Also, it is less repetitive.

Another idea: user-provided memory. E.g. if the user passes arrays of sufficient length (for nodes, etc), then the whole Schema can live on the stack which could potentially give quite a big performance benefit.

Extending

I think it would be nice to expose the Validate trait (and maybe rename it to Keyword, not sure).

Extras

Error paths could be better - they should expose the inner chunks, otherwise, the user always gets a vec of String via into_vec. However, it would be nice to avoid allocations.

Internals

The main thing here is that I’d like to move away from boxing each individual keyword and use a single arena - it allows us to simplify ref resolving and move it to the schema compilation step. All other things like strings could be moved to some interner that is memory efficient

Currently, we have all the output formatting internals together with boxed keywords which implies some performance hit. It would be better to keep them separately and use them only if output formats are used, i.e. it should be a zero-cost abstraction.

Another one is generic input, so we don’t depend on serde_json layout. This is a huge benefit for any bindings + will unblock direct validation of e.g serde_yaml values. This change will also affect the possibility of exposing a C API, but not yet sure about this.

Project structure

Not necessarily needed for 1.0, but would help.

Ideas:

  • Workspace
  • Moving crates into /crates/
  • Split CLI into a separate crate. The library would be simpler: no cli feature -> no conditional compilation for wasm32-wasi.

Awesome stuff @Stranger6667 !

+1 on everything, and particularly spec support, the Validate trait, and having a separate CLI crate.

You may also want to specify an explicit MSRV policy and consider bumping it to a more current Rust as 1.56.1 is pretty old. Using a newer Rust also unlocks newer language features that JSONschema can leverage in your roadmap.

Finally, can't help but notice that crates.io is still on 0.16.1 and the repo is on 0.16.3. Any particular reason the newer releases are not published on crates.io?

@jqnatividad thank you for your feedback and kind words!

I think there will be some series of pre-releases to get the feeling of how things work and if everything is good, then I'd like to add newer specs. Recently I made some progress on the rewrite and I think things should work out with that approach, but still need to re-check things like custom keywords, anchors, dynamic refs, etc.

You may also want to specify an explicit MSRV policy and consider bumping it to a more current Rust as 1.56.1 is pretty old. Using a newer Rust also unlocks newer language features that JSONschema can leverage in your roadmap.

Great idea! Something like 1.60, I guess? What are the features you have in mind? Personally, I would be really happy with newer strings formatting, but probably I miss some others :)

Finally, can't help but notice that crates.io is still on 0.16.1 and the repo is on 0.16.3. Any particular reason the newer releases are not published on crates.io?

Sorry for the confusion. 0.16.3 is only in Python bindings and there were no significant changes in the jsonschema crate itself. I guess I'd need to make GH releases and add a bit more information on the README

@Stranger6667 , jsonschema-rs is really amazing as is.

With it, qsv validate can validate a CSV against a non-trivial JSONschema at 300,000 records/sec with rayon! (jqnatividad/qsv#164)

Bumping the MSRV should make it even faster for "free" as the underlying under-the-hood Rust components have been updated. Inlined string formatting is great, and you should consider setting MSRV to 1.65.0 - with its let else and Generic Associated Types (6.5 years after the RFC was created!).

Just for the heck of it, I bumped the MSRV to 1.65.0 and ran cargo +nightly clippy -- -W clippy::pedantic and it found a lot of candidates for inlined string formatting and manual let-else's.

As for 0.16.3, you may want to reconsider publishing to crates.io. If anything, coz of the lazy_static=>once_cell replacement and the numerous bumped dependencies.

@jqnatividad thank you! It is always pleasant to hear about performance, especially when it comes to such numbers! Hopefully, there could be some more improvements coming from jsonschema that will positively affect your use case :)

Aha, at this point it is hard to say whether there is a direct use case for GATs, but let else is definitely nice! Will check it for future versions :)

As for 0.16.3, you may want to reconsider publishing to crates.io. If anything, coz of the lazy_static=>once_cell replacement and the numerous bumped dependencies.

Oh, indeed you're right - I forgot that I did it! Will make a new release in the next few days :)

UPDATE: I've been working on the complete rewrite during the last month and have made some progress.

I am going to update the original issue comment accordingly.

So far I have implemented a generic JSON trait that allows for almost zero overhead working with the underlying JSON implementation and additionally provides a way to choose the output type for numbers (and the corresponding arbitrary-precision feature). Currently, I have implementations for serde_json::Value & pyo3::PyAny. There are still some parts to polish, but roughly working with it looks like this:

fn works<J: Json>(value: &J) {
        let object = value.as_object().expect("Should be an object");
        assert!(object.get("a").is_some());
        assert!(object.get("b").is_none());
        let val = value
            .as_object()
            .and_then(|obj| obj.get("c"))
            .and_then(|k| k.as_string())
            .map(|s| s.borrow().to_owned())
            .expect("Key exists");
        assert_eq!(val, "d");
    }

This approach also will help with using this library for C/C++ code by wrapping FFI pointers and implementing the Json trait for them.

Currently, I work on referencing, taking great inspiration from Python's referencing library that abstracts away all the ref resolving differences among different specs. My goal is to make it work with the Json trait too, preserving the original representation. This makes the im approach taken in referencing not feasible.

Once referencing is done, I am going to work on the schema compiler, which will be built on top of Json + referencing. The idea is that we can resolve all references upfront (blocking/non-blocking for remote ones) and build a packed representation that does not depend on the original representation. I think it will be enum dispatch + a BFS / Eytzinger layout for fast traversal, likely something that I used in my css-inline crate as it showed nice performance results (even though nodes are not sorted properly). Then iteration will be quite easy to implement (I already implemented it for a similar data structure in css-inline), which will give real lazy error iteration without tons of flat_map and collecting intermediate vectors.

Then output formatting as per the JSON Schema spec.

Then custom keywords & format validators. With enum dispatch, I think it could be just a Custom variant that wraps a boxed dyn. I think it will also allow for simpler impl for PyO3 and as it will use the Json trait, it will be quite fast.

In any event, I am not yet ready to share what I was working on, but it will be a complete from scratch rewrite after multiple attempts (hopefully this won't be an unfinished one).

P.S. And yes, it will support all current specs

Is there a plan to introduce $ref to the local file, in addition to the remote location? For example, I have two schema files in the same dir, and I want to reference from the first file to the second one. Example:

"$ref": "file://resources/schema/port/schema.json#/$defs/outPortType"

@djpesic yes, I want to have resolvers to support the “file” scheme by default + have a way to build a custom resolver (which is possible with the current version too). Currently it is supported via the resolve-file feature

Closing in favour of #475