Internal buffering disrupts format-specific deserialization features
mitsuhiko opened this issue ยท 17 comments
Was not sure if I should file this against serde_json
or here. As far as I can tell the bug is more likely in serde itself. While implementing #1179 I noticed that non string keys are not supported for flattening. However the same behavior happens if buffering happens for internally tagged enums:
Example that shows the problem:
#[macro_use]
extern crate serde_derive;
extern crate serde;
extern crate serde_json;
use std::collections::HashMap;
#[derive(Deserialize, Debug)]
#[serde(tag = "type")]
enum E {
M(HashMap<u32, ()>),
}
fn main() {
let j = r#" {"type": "M", "1": null} "#;
println!("{}", serde_json::from_str::<E>(j).unwrap_err());
}
This currently fails with this error:
invalid type: string "1", expected u32
However a HashMap<u32, ()>
is otherwise entirely permissible by serde_json
.
There are some other JSON-specific assumptions baked into Content
too. For example the representation of Serde enums happens to be what serde_json uses but not necessarily shared by other formats.
I suspect we will need a format-specific optional API on Deserializer
for buffering data in a way that is consistent for that format. The default implementation would basically be equivalent to today's Content
but formats could override it to use their own representation. Even serde_json may want to switch and use something like serde_json::Value
for buffering rather than Content
.
We should start by identifying all the ways Content
is used in our generated code and generalize those into an API that can be provided by the deserializer.
This is complicated by associated type defaults being unstable rust-lang/rust#29661. I thought we might want a format-specific buffer type as a Deserializer associated type with the default being Content
, but we will need to find a different way.
Here is a backdoor way to add an associated type.
struct Content;
trait Deserializer {
// Suppose we want something like this but associated type defaults are unstable.
type Buffer = Content;
}
trait Deserializer {
// Instead write it as an associated function that forwards to a generic callback.
fn deserialize_with_buffering<F>(callback: F) -> F::Value
where
F: DeserializeWithBuffer,
{
callback.run::<Content>()
}
}
trait DeserializeWithBuffer {
type Value;
fn run<Buffer>(self) -> Self::Value;
}
Is there any update on this?
Customizing the buffer type by Deserializer could provide a powerful alternative to #1327. The key constraint being satisfied in that approach (as compared to just stashing state in some thread local) was to accurately expose state during deserialization of untagged, internally tagged, and flattened members.
If a Deserializer can replace the buffer type, a library could provide a stateful Deserializer built on TLS by wrapping any existing Deserializer and wrapping that Deserializer's buffer type to correctly track the state.
Let's give this another attempt as associated type defaults (rust-lang/rust#29661) approaches stabilization. I believe the workaround without associated type defaults in #1354 could have been made to work but would have been too complicated to maintain.
Is it possible that this issue explains the behavior described in nox/serde_urlencoded#66? In that case, the author is trying to deserialize the key-value pair bar=123
into a variant B { bar: i32 }
in an untagged enum. It seems like this should work. Deserialization fails because "data did not match any variant of untagged enum Q2". When using a similar enum Q1
where the variant is B { bar: String }
, it works as expected, with bar
= "123"
.
Yeah that looks like a consequence of this issue. #[serde(with = "serde_with::rust::display_fromstr")]
on the field (https://docs.rs/serde_with/1.4.0/serde_with/rust/display_fromstr/index.html) would work around it in that case but obviously is not ideal.
I have the following seemingly simple usage of flatten
with an enum in the Inner
struct. Is this crash also related to this issue? I see most failures with flatten
are reffered to here, but haven't seen this particular case yet.
use serde::{Deserialize, Serialize};
#[derive(Serialize, Deserialize)]
enum E {
A,
B,
}
#[derive(Serialize, Deserialize)]
struct Inner {
e: E,
}
#[derive(Serialize, Deserialize)]
struct Outer {
#[serde(flatten)]
inner: Inner,
}
fn main() {
let yaml = "e: !A";
// Works fine:
let _inner: Inner = serde_yaml::from_str(&yaml).unwrap();
// Fails at runtime with:
// e: untagged and internally tagged enums do not support enum input
let _outer: Outer = serde_yaml::from_str(&yaml).unwrap();
}
The docs have a sentence saying
"It is supported only within structs that have named fields, and the field to which it is applied must be a struct or map type."
Maybe that can/should be updated to mention that enums anywhere in the flattened struct (also more deeply nested) are not allowed either (at least when deserializing YAML).
@RagnarGrootKoerkamp that sounds like it is unrelated to flattening, does it work when you just try to deserialize that string as Inner
(because it that fails with the same error, then it is unrelated to flattening, otherwise it's a bug in the flatten impl afaik)
@zseri Yes, deserializing to Inner
works fine. Updated the example above.
Should I create a separate issue then?
This might be related: dtolnay/serde-yaml#388
Yaml can treat a value of 42 as a string. This works without flatten, but fails with it.
Stepping on the code in gdb, the interesting transition seems to be:
#0 serde::__private::de::content::{impl#15}::deserialize_string<serde_yaml::error::Error, serde::de::impls::StringVisitor> (self=..., visitor=...)
at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/private/de.rs:1260
#1 0x000055555556516d in serde::de::impls::{impl#8}::deserialize<serde::__private::de::content::ContentDeserializer<serde_yaml::error::Error>> (deserializer=...)
at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/de/impls.rs:582
#2 0x00005555555681f0 in serde::de::{impl#5}::deserialize<alloc::string::String, serde::__private::de::content::ContentDeserializer<serde_yaml::error::Error>> (
deserializer=...) at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/de/mod.rs:794
#3 0x0000555555571b68 in serde::__private::de::{impl#10}::next_value_seed<serde_yaml::error::Error, core::marker::PhantomData<alloc::string::String>> (
self=0x7fffffffd340, seed=...) at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/private/de.rs:2814
#4 0x00005555555716c6 in serde::de::MapAccess::next_value<serde::__private::de::FlatStructAccess<serde_yaml::error::Error>, alloc::string::String> (
self=0x7fffffffd340) at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/de/mod.rs:1865
#5 0x0000555555563867 in serde_yaml::_::{impl#0}::deserialize::{impl#2}::visit_map<serde::__private::de::FlatStructAccess<serde_yaml::error::Error>> (__map=...)
at src/main.rs:3
#6 0x0000555555571e31 in serde::__private::de::{impl#8}::deserialize_struct<serde_yaml::error::Error, serde_yaml::_::{impl#0}::deserialize::__Visitor> (self=...,
fields=&[&str](size=1) = {...}, visitor=...) at /home/espindola/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.188/src/private/de.rs:2674
#7 0x0000555555563300 in serde_yaml::_::{impl#0}::deserialize<serde::__private::de::FlatMapDeserializer<serde_yaml::error::Error>> (__deserializer=...)
at src/main.rs:3
#8 0x0000555555564d80 in serde_yaml::_::{impl#0}::deserialize::{impl#2}::visit_map<&mut serde_yaml::de::MapAccess> (__map=0x7fffffffd6d0) at src/main.rs:8
We go from using deserialize_string from serde_yaml, to using one from serde, which doesn't have the required logic.
Let's give this another attempt as associated type defaults (rust-lang/rust#29661) approaches stabilization. I believe the workaround without associated type defaults in #1354 could have been made to work but would have been too complicated to maintain.
It seems like rust-lang/rust#29661 still has a way to go, so in case #![feature(return_position_impl_trait_in_trait)]
is stabilized earlier, perhaps something like the following could also work:
pub trait Deserializer<'de>: Sized {
/* ... */
fn deserialize_buffered(self) ->
Result<impl Clone + Deserializer<'de>, Self::Error>
{
crate::__private::de::Content::deserialize(self)
}
}
Are there other options that could perhaps be used to provide this feature earlier, since several serde-supporting formats have to battle buffering-based issues until then?
@dtolnay Since return impl trait in traits will now be stabilised soon, do you think that using it to solve this issue would be an avenue worth pursuing?
Any update on this matter? Anything we can do to assist?
I'll repeat the last comment:
Any update on this matter? Anything we can do to assist?