Reduce stack size required for deserialization [Help]
Opened this issue · 8 comments
We have a method deserialize which is a wrapper over serde_cbor::de::from_slice_with_scratch
method. The code is written for an MCU environment, hence we are using no_std methods. The method is:
pub fn deserialize<'a, T: Deserialize<'a>>(slice: &'a [u8]) -> Result<T, SerdeError> {
info!("stack HWM before serde call: %d", Task::current().unwrap().get_stack_high_water_mark());
let value = serde_cbor::de::from_slice_with_scratch(&slice, &mut []);
let output = match value {
Ok(_e) => Ok(_e),
Err(_e) => Err(SerdeError::DeserializationFailed)
};
info!("stack HWM after serde call: %d", Task::current().unwrap().get_stack_high_water_mark());
return output;
}
We have added log lines to understand the stack requirement. We receive commands over BLE and deserialize the data. As per current architecture, we do 3 level of deserialization.
The following are the logs of 1st command sent over BLE:-
[15:28:59.842] : [elf::utils::serde_module] stack HWM before serde call: 2852
[15:28:59.844] : [elf::utils::serde_module] stack HWM after serde call: 1562
[15:28:59.845] : [elf::utils::serde_module] stack HWM before serde call: 1562
[15:28:59.847] : [elf::utils::serde_module] stack HWM after serde call: 1470
[15:28:59.850] : [elf::utils::serde_module] stack HWM before serde call: 2905
[15:28:59.851] : [elf::utils::serde_module] stack HWM after serde call: 768
The following are the logs of 2nd command sent over same BLE connection:-
[15:29:21.260] : [elf::utils::serde_module] stack HWM before serde call: 1470
[15:29:21.262] : [elf::utils::serde_module] stack HWM after serde call: 1470
[15:29:21.264] : [elf::utils::serde_module] stack HWM before serde call: 1470
[15:29:21.265] : [elf::utils::serde_module] stack HWM after serde call: 1470
[15:29:21.266] : [elf::utils::serde_module] stack HWM before serde call: 768
[15:29:21.268] : [elf::utils::serde_module] stack HWM after serde call: 768
My query: Is there a way to reduce stack consumption?
We are working with nRF52840 SoC and freeRTOS as our OS.
Pasting the high water mark defination here:
The minimum amount of remaining stack space that was available to the task since the task started executing – that is the amount of stack that remained unused when the task stack was at its greatest (deepest) value
@pyfisch
Do you have any data around the memory and CPU consumption during serialization and deserialization?
My query: Is there a way to reduce stack consumption?
I don't think so if you use serde-cbor. Serde uses recursion to do deserialization and therefore every nested structure adds at least one more stack frame to the stack.
As CBOR is really simple you can write your own serializer/deserializer and achieve lower resource usage. In this case I would avoid recursion and store intermediate results on the heap.
Do you have any data around the memory and CPU consumption during serialization and deserialization?
No, I don't collect such data for microcontrollers.
@wildarch added no-std support to this crate, maybe he can give you better advice on how to use serde on microcontrollers.
Serde has a little known (and slightly hidden) feature where it can deserialize straight into the target struct. For this you can use Deserialize::deserialize_in_place
, it is hidden from the docs but you can see it here. I was in a similar situation with nested objects and it worked like a charm for me.
Hope this helps 😄
@wildarch Thanks for quick response. I really appreciate you helping me out. I have few more questions:
-
I have enabled "deserialize_in_place" feature for serde_derive. Do I need to enable "alloc" feature on serde for it to work?
serde_derive = { version = "1.0.103", default-features = false, features = ["deserialize_in_place"]} -
Does it have any adverse impact on heap if we turn on alloc feature?
-
Can you please share sample usage for the same?
I used the existing code from serde_cbor to test this feature but I didn't observe any difference in stack consumption. My cargo dependencies:
serde = { version = "1.0", default-features = false, features = ["derive", "alloc"] }
serde_derive = { version = "1.0.103", default-features = false, features = ["deserialize_in_place"]}
serde_cbor = { version = "0.10.2", default-features = false}
pub fn deserialize_in_place<'a, T: Deserialize<'a>>(slice: &'a [u8], mut place: T) -> Result<T, Error>
{
let mut deserializer = Deserializer::from_slice_with_scratch(slice, &mut []);
let value = de::Deserialize::deserialize_in_place(&mut deserializer, &mut place)?;
deserializer.end()?;
Ok(place)
}
Unfortunately the code I wrote using this is proprietary and I've since left the company, so I'll have to answer this off the top of my head 😅.
- I don't think you need the alloc feature. You can try to compile without and see if it works?
- You can check if intermediate stack objects are created by deserializing to a global variable, that way you shouldn't see any large stack allocations. Try to not pass your T by value in deserialize_in_place, that might also help
- Make sure that all types and layers of your struct support deserialize_in_place, as serde will default to the normal deserialize call for types that don't implement deserialize_in_place. If you
#[derive(Deserialize)]
on all your structs and don't use any non-core base types you should be fine.
@PraneshAnubhav this might be a dumb question --
In docs for from_slice_with_scratch
it says:
Users should generally prefer to use from_slice or from_mut_slice over this function, as decoding may fail when the scratch buffer turns out to be too small.
A realistic use case for this method would be decoding in a no_std environment from an immutable slice that is too large to copy.
https://docs.rs/serde_cbor/0.10.2/serde_cbor/de/fn.from_slice_with_scratch.html
Yet, you are passing as scratch an empty slice &mut []
.
info!("stack HWM before serde call: %d", Task::current().unwrap().get_stack_high_water_mark());
let value = serde_cbor::de::from_slice_with_scratch(&slice, &mut []);
Doesn't that seem like it would be harmful, since you are giving no scratch at all for the impl to use?
Did you try either making a heap allocation for the scratch space, or using some global mutable memory perhaps thread local or with a lock if you cannot use the heap at all? I imagine that providing scratch space may reduce stack pressure.