Typed messaging and validation
goodboy opened this issue ยท 14 comments
I was originally going to make a big post on pydantic
and how we could offer typed messages using that very very nice project despite there being a couple holdups for integration with msgpack
.
However, it turns out just today an even faster and msgpack specific project was released: msgspec
๐๐ผ
It claims to not only be faster then msgpack-python but also supports schema evolution and other niceties
It also has perf bumps when making multiple repeated encode/decode calls which is exactly how we're currently using msgpack
inside our Channel
.
Overall there looks to be no downside and we'll get typed message semantics fast and free ๐๐ผ
For reference, I'll leave a bunch of links I'd previously gathered regarding making pydantic
work with msgpack
:
- pydantic/pydantic#951
- https://pydantic-docs.helpmanual.io/usage/dataclasses/
- pydantic/pydantic#595
- fastapi/fastapi#1285
- https://github.com/MolSSI/QCElemental/blob/master/qcelemental/models/basemodels.py#L121
- this is just adding a
BaseModel.serialize()
effectively which looks up a serialize method by name (eg. json, msgpack) but isn't really adding any "native feeling" support nor speed gains afaict.
- this is just adding a
TODO
- support for a
msgpack-python
custom type serializer forpydantic.BaseModel
such that we just implicitly render with.dict()
as pack time and load via `Model(**message)`` at decode time? - write ourselves a small bytes-length prefixed framing protocol for
msgspec
as per the comments in #212- example from a blog post on protobuf
- consider how we might wrap
trio.SocketStream
using something liketricycle.BufferedReceiveStream
; @oremanj was nice enough to provide usage:
while header := await stream.receive_all_or_none(4): len, = struct.unpack("<I", header) # probably want to sanity-check len for not being unreasonably huge chunk = await stream.receive_exactly(len) # do something with chunk
- consider offering
msgspec
as an optional dependency if we end up liking it?
That's really neat! I was looking at implementing Pydantic in a project a little while ago, and chose not to. It seemed like the API wasn't quite what I was looking for. I was wanting data classes, and confidence that serialization and deserialization were both strict. I'm not quite sure why I concluded that, unfortunately. I knew about the data classes integration with Pydantic, but there was something missing with it that I felt I needed.
msgspec looks pretty cool for when you control the data format, but that definitely wasn't part of what I was doing. (I was writing an API wrapper over a JSON API).
I know many people have gotten a lot of mileage out of Pydantic. It's a great project.
Yeah alternatively we've been thinking about using capnproto
and in particular seeing if we can auto-gen schema from type annotated Python functions.
I think this would be a huge boon since we'd get CBS (capability based sec) for free ๐๐ผ.
The only holdup will be figuring out how pycapnp
can work with async stuff and if it can help us with the schema gen/loading.
There appears to now be asyncio
support but not sure how/if that will get in our way or if we can work off that impl to support trio
.
Oh also another notable project (for a tractor
dependent that will likely soon be broken out on it's own repo) there is
nptyping
which may prove useful in automatic serialization of arrays.
Linking to jcrist/msgspec#25 since we'll likely need nested Struct
s to make this the most easy to implement (messages containing strictly typed payloads also defined as structs) otherwise there may need to be some finagling to either hack a standard message schema where payload's are decoded specifically as structs or we'll need to just always decode to a dict
. It would be better to have the former considering the supposed speed improvement:
Depending on the schema, deserializing a message into a Struct can be roughly twice as fast as deserializing it into a dict.
in particular seeing if we can auto-gen schema from type annotated Python functions.
Is there an issue for this.
Essentially to do this, we need to:
- Parse dataclasses and save
Field
attributes - Feed this into networkx to build graph with
child
,isa
and 'hasa` relationships - Use the builder pattern over the networkx graph with a dialect (capnproto or probuf etc)
@gc-ss not yet specifically; feel free to make one of course if you have some ideas and/or want to try it out.
Also, i think this could be easily wrapped in an external repo for use as well; it doesn't have to be tractor
specific.
Feed this into networkx to build graph with child, isa and 'hasa` relationships
@gc-ss wait why would you need this?
Afaiu graph relations aren't relevant here; are you talking about building nested structs as trees or?
Afaiu graph relations aren't relevant here; are you talking about building nested structs as trees or?
Consider this:
class A:
a: int
class B(A):
b: int
class C(A):
c: int
class D:
composes_c: C
Now if we wanted auto-gen schema for type D, we don't want to spit out B. Also, it's possible some schema libraries might want schemas to be ordered in a certain way depending on the dependency tree.
So you need graphs
What do you think?
If this makes sense, I can move these into a different repo and send you a link.
@gc-ss yah, as I was thinking you mean for composed structs/types.
If this makes sense, I can move these into a different repo and send you a link.
Cool, yeah if you're interested in working on this then for sure.
We can also experiment here around the tractor
IPC apis and see how it forms out with tinkering, then move it to a new project.
Up to you, I don't have immediate bandwidth for this.
First hold up with msgspec
is mentioned in jcrist/msgspec#27, they have no streaming decoder api.
No longer a problem, we just have to write a prefix framing stream packer; see above.
Hmm alternatively to get typing going sooner then later we could just make some pydantic
message type handlers. Pretty sure all we'd need it detection of a BaseModel
and then serialization with .dict()
on encode and decode into a BaseModel(**dict)
.
Pretty sure we could offer this as an extras dependency as well?
Linking explanation from jcrist/msgspec#25
Probably worth noting is dataclass union libs like https://github.com/yukinarit/pyserde
Hilarious to see a writeup of what we've been doing in this repo for years ๐
https://kobzol.github.io/rust/python/2023/05/20/writing-python-like-its-rust.html#fnref:2
the part on ADTs is particularly notable as part of this feature work ๐๐ผ