JSON type errors cause events to be discarded, creating incompatibility with Element and other clients
Opened this issue · 4 comments
This issue was going to originally be about specifically accepting JSON null values for optionally-present fields, but I think in general, it may be the expected behavior that clients be much more tolerant of out-of-spec JSON types than is currently implemented.
It seems that due to the lack of weak-typing-like conversions in the JSON library, a lot of events can potentially be discarded with errors like this:
[json.exception.type_error.302] type must be string, but is null
An example event I saw generated by an unnamed bot, where the thumbnail_info and thumbnail_url fields are null:
{
"type": "m.room.message",
"sender": "<redacted>",
"content": {
"body": "<redacted>",
"info": {
"size": 218987,
"mimetype": "image/png",
"thumbnail_info": null,
"w": 462,
"h": 462,
"thumbnail_url": null
},
"msgtype": "m.image",
"url": "mxc://<redacted>"
},
"origin_server_ts": 1654407062081,
"unsigned": {
"age": 386186
},
"event_id": "<redacted>",
"room_id": "<redacted>"
}
I believe this is a malformed event, but Element, NeoChat and Quaternion will process and render it, while Nheko will discard it. (I did not test any other clients)
I think its tough to decide if/how a malformed event should be dropped or tolerated, but I think these Javascript-like rules make sense, at least for room event JSON content:
- Accepting a partial or empty object (e.g. the thumbnail_info value is empty, {}; or incomplete, {"w":100} with no other fields) as equivalent to the value being non-present -- Although perhaps some structures would need to be handled specially depending on which specific fields are present/valid
- Accepting a primitive type (null, int, string, bool), or array, as equivalent to an empty object or empty array, including the role of empty objects in the above rule -- Object keys that could be numeric do exist in Matrix, but I don't believe any exist in room events, so object<->array inter-compatibility is not needed
- Accepting any floating point number as equivalent to an int, including its role in the above rule as a primitive type -- Matrix does not use floating point numbers, so I believe this is OK -- if a future change or proposal needs it, an explicit request for the original floating point value could be made
- Accepting a non-present, but required key, as equivalent to an empty string / zero value / empty array / empty object -- I am the least sure about the future implications of this as specs change and extensions grow
- Implementing other primitive type (null, int, string, bool) conversions somewhat equivalent to Javascript -- Such that a string representation of a number in a field that is specified as an int could be processed, etc.
This would probably involve wrapping a lot of the JSON library interfaces with getters that allow for conversion, as well as changing a lot of various code to take more care not to assume that all expected object keys will exist.
I think these changes are logical for the following reasons:
- Its reasonably logical, and other implementations are likely to end up working this way, even by accident, given how weak typed languages and JSON libraries tend to work.
- Element seems to work this way in practice already.
- Malformed messages are being generated and do exist (Also see: matrix-org/synapse#8445)
- I believe the proposed behavior has a very low chance of breaking on any future valid messages, requiring a very bad change to the spec, like having a value that can either be null or not-present with different behavior.
Some related matrix.org stuff about clients being forced to deal with un-trusted event data:
- https://github.com/matrix-org/matrix-spec-proposals/blob/v1.2/proposals/2801-untrusted-event-data.md
As far as I can find, Matrix does not specify how clients should respond to out-of-spec event data, but it seems logical to stick to the behavior which most closely matches the most popular implementation, and to avoid cases where any room events can get dropped despite all of the critical info being present and able to be interpreted, and especially in cases where events get dropped just because some optional data is invalid.
There is one part of the spec that specifies behavior in the case that a value is "If not present, null, or empty", despite specifying it as a string, which supports the idea that clients should be tolerant in this way, however.
The current behaviour has protected us from multiple security vulnerabilities, that Element fell for, because they didn't sanitize their data. That's why we are conservative in when we add workarounds to invalid events being sent and tend to add workarounds if a specific form of invalid event is commonly sent.
If there is no popular strict client, every client will need to deal with invalid events. This makes developing for Matrix significantly harder. So imo it is the right thing to do to decline invalid events.
But on the other hands having missing events in your client is not ideal as well. Currently we fix that whenever we find such an issue, but yeah, maybe you are right and we should just allow all invalid events to render...
I'd be interested in seeing info about relevant security vulnerabilities.
However I can't imagine that there would be any security issue created as a result of behavior that is equivalent to a fairly-well-defined mapping from out-of-spec JSON to in-spec event data, considering that the event in either format could have been in the room in the first place.
Also sorry for the big issue ^^ I am getting somewhat familiar with the code but the reach of this is probably far too broad for me to attempt to attempt on my own.
Also sorry for the big issue ^^ I am getting somewhat familiar with the code but the reach of this is probably far too broad for me to attempt to attempt on my own.
Maybe it would be a good idea to just fix the thumbnail_url parsing issue for now to get familiar with the process? :3
However I can't imagine that there would be any security issue created as a result of behavior that is equivalent to a fairly-well-defined mapping from out-of-spec JSON to in-spec event data, considering that the event in either format could have been in the room in the first place.
Well, there was no well defined mapping. Element just parses the events without any verification, which lead to fun bugs where you could make some content only show when replying to the event via Element (and replacing the content you replied with, effectively impersonating you). Similarly Element accepted strings as power levels, which would break the federation of the room between different server implementations. In other clients invalid events make the whole room fail to render. By applying some generic mapping, we are basically pushing the error handling up the stack and higher layers of the application will have to deal with them.
Long term extensible events might help with this, as you can parse only the valid subset of a message. Until then, maybe we could instead define an "InvalidEvent" type, which would in the UI be rendered as clearly being invalid, but could still be inspected via "view source"?
could make some content only show when replying to the event via Element
This sounds very interesting. I think this might be down to accepting semantically invalid messages, rather than just syntactically invalid. Something that is probably possible to accidentally do in the process of making allowances on "required" keys in objects. The general solution of "partial sub-objects are treated as if they weren't even there" should avoid such an issue, otherwise care needs to be taken using common sense knowledge of how to react to data that isn't there
I am definitely not proposing something that would allow a message with a blank or invalid msgtype, for example, to get passed through to the app in some broken form, because the type system should not even allow such a thing.
Ideally, no combination of struct field values will be created that couldn't already have been created using a differently formatted blob of JSON, and that is enough to guarantee that the format corrections cannot introduce any new bugs.
accepted strings as power levels, which would break the federation of the room between different server implementations
If the type got propagated to the actual room state, it is a server issue for sure. If mtxclient is using any JSON data as the direct input to send JSON back, then something like that may be an issue.
pushing the error handling up the stack
I don't think this is true, because all out-of-spec event handling is taken care of completely before it reaches the user application. It is simply the difference between accepting or refusing to pass along a message because it has some out-of-spec format issue, like image dimensions being encoded as a string -- in either case it reaches the app as a concrete type, with the same possible range of values that it always had.
Until then, maybe we could instead define an "InvalidEvent" type
I am unsure about this, for the reason that its not that the event is necessarily invalid. A valid event can be the result of parsing the JSON message, even if it was improperly encoded.
I think I'd prefer, at most, a flag to the application to indicate that some out-of-spec conversion was made to show a visual indicator could be useful for developers, or people who might need to be aware that "This event's encoding is not quite right and may not display/interact the same, or even not be visible at all, in other Matrix clients".
However, any event may appear totally different in another client due to differences in extensions -- but then, maybe another flag could warn the application that there was unknown extension data present in the event.
Maybe it would be a good idea to just fix the thumbnail_url parsing issue for now to get familiar with the process? :3
Incremental improvement sounds like a plan. I think its pretty agreeable that a null value in an optional field is a pretty clear-cut candidate to be handled.