Stranger6667/jsonschema-rs

simd support?

Opened this issue · 6 comments

Already, jsonschema-rs is quite performant.

However, have you looked into using crates like simd-json, simdutf8 to make it even faster?

Yes,

I am actively looking into these things and wanted to publish a design document to get feedback on implementation. It is also somehow a roadmap to 1.0 and will contain at least the following areas:

  • Keywords layout. As described in #212. I started a complete rewrite in a separate repo and just this change yields ~50% validation time reduction in some benchmarks + simplifies the code dramatically (it also uses enum_dispatch). Though it is incomplete but unlocks a lot - for example, there will be no need for RwLock in $ref as it will be possible to evaluate them at the compilation phase.
  • Custom input types. It seems like the way to support the crates you mentioned + other external types (like Python ones). Not sure what would be the best way to do so :( my attempts to wrap serde_json::Value without sacrificing too much were not successful.
  • Real error iterator. Now there are tons of unnecessary allocations on each validate call + all the flat_map calls are responsible for long compile times (according to llvm-lines). I'd like to have some tree iterator that doesn't allocate intermediate vectors - not sure about the right way to suspend/resume such a process. Maybe a separate state machine transitions table would work for this.
  • Avoid extra costs of SchemaNode - it is not needed for is_valid and validate calls, but adds extra overhead.

I expect to have it in a few days and it is roughly my roadmap for this lib :) I'd appreciate if you could share your thoughts on this or share your use case for integrating the crates you mentioned

Sorry I didn't get back earlier, but thanks for your thorough response!

I don't know enough about your implementation to cogently comment on your points, but the details I can tease out indicates that there's a lot of headroom the library can exploit to squeeze more performance.

I'm looking forward to the design document!

What I can contribute are my use-cases.

Currently, I'm using jsonschema-rs to validate CSV files (and that's why I originally asked about #339 ), and after using rayon, the performance is already quite impressive.

jqnatividad/qsv#164

But as the flamegraph shows, any incremental performance from jsonschema will further accelerate qsv's validate cmd.

I plan to leverage the qsv validate command in another project - https://github.com/dathere/datapusher-plus to validate CSV files before they are uploaded to CKAN.

@Stranger6667

#212. I started a complete rewrite in a separate repo and just this change yields ~50% validation time reduction in some benchmarks + simplifies the code dramatically (it also uses enum_dispatch). Though it is incomplete but unlocks a lot - for example, there will be no need for RwLock in $ref as it will be possible to evaluate them at the compilation phase.

That sounds really interesting! Is that repo publically available?

@manuschillerdev I added it as a separate crate here - #373 :) It is a prototype, but ref resolving is more or less ready

Btw, @jqnatividad thanks for sharing your use case! I hope that soon we all can benefit from faster validation! :)

the changes though are quite large and I’ll appreciate any help there :)

@Stranger6667 I'll start testing the jsonschema-csr prototype and will let you know my findings!

I need to update qsv's benchmarks soonish and I'll be sure to include the prototype in it when I do.

And once I grok the internals, you can be sure I'll try to help as best as I can.

@jqnatividad Thank you! The currently submitted version is not working yet, but I am slowly working on it :)