Code size unnecessarily grows per each interface
Closed this issue · 5 comments
Hi, as mentioned last week, I decided to transition from usbd-hid to this crate. I was able to conveniently recreate the functionality that I previously implemented using usbd-hid, which now is:
UsbHidClassBuilder::new()
.add_interface(KeyboardInterface::default_config())
.add_interface(MouseInterface::default_config())
.add_interface(ConsumerInterface::default_config())
.build(bus)
For reference, my implementation that used usbd-hid was basically creating 3 separate instances of usbd_hid::HIDClass with different settings and reports, then passing all of these to usb_dev.poll(&[...])
.
Comparing my binary size before and after the transition I noticed that the size increased by around 5 kB for a release build with opt-level = "s"
and lto = "fat"
). I can see that usbd-hid is much simpler and lacks implementation of some features, like GET/SET_IDLE and GET_REPORT, but 5 kB is a huge difference when having 64 kB of flash and already using over 40 kB.
Problem details
I did some investigation and have the following findings:
usbd-human-interface-device/src
takes around 3.5 kBusbd-hid-0.6.1/src
takes 982 B- some code of usbd-hid might have gone "into my application source code", but the difference is at most 200 B (my code with usbd-hid was 200 B larger)
- a lot of code size comes from increased
fmt
size (1.8 kB before, 3.5 kB after) - usbd-human-interface-device introduced
str/mod.rs
taking 370 B (two functions:core::str::slice_error_fail_rt
andcore::str::slice_error_fail
) - changes to the sizes of the rest of modules are minimal
Now for some details here are some screenshots from the reports generated with elf-size-analyze:
As you can see, there are two separate problems. One is all the formatting code, I tried removing all the calls to log::info!
, etc. but this made no difference so all the logging seems to be optimized out as unused, so my guess is that it is due to panics in the code, but I will try to investigate it further.
Code duplication
The second issue is that there is a lot of code duplication due to usage of generics. Each interface has separate implementation of InterfaceClass
and these functions are not optimized out.
Some of these functions are just "trampolines" e.g.
08005be8 <<usbd_human_interface_device::device::keyboard::BootKeyboardInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_idle::h0e5b58f60dacd33f>:
8005be8: b580 push {r7, lr}
8005bea: af00 add r7, sp, #0
8005bec: f7fe fbca bl 8004384 <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_idle::h3398285bc300a45b>
8005bf0: bd80 pop {r7, pc}
08005c4c <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::reset::h9cc9a8850225cee5>:
8005c4c: b580 push {r7, lr}
8005c4e: af00 add r7, sp, #0
8005c50: f7fe fb6c bl 800432c <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::reset::hed6d4de382068139>
8005c54: bd80 pop {r7, pc}
08005c56 <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::set_report::h14ad00db97731e9b>:
8005c56: b580 push {r7, lr}
8005c58: af00 add r7, sp, #0
8005c5a: f7fe f9fb bl 8004054 <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::set_report::h81c003fa6383d9b7>
8005c5e: bd80 pop {r7, pc}
or simple inlined ones:
08005bfa <<usbd_human_interface_device::device::keyboard::BootKeyboardInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_protocol::h1da62ffbb86f55b1>:
8005bfa: 21ff movs r1, #255 ; 0xff
8005bfc: 31c8 adds r1, #200 ; 0xc8
8005bfe: 5c40 ldrb r0, [r0, r1]
8005c00: 4770 bx lr
08005c90 <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_protocol::hb7313df2838bd99a>:
8005c90: 21ff movs r1, #255 ; 0xff
8005c92: 31c8 adds r1, #200 ; 0xc8
8005c94: 5c40 ldrb r0, [r0, r1]
8005c96: 4770 bx lr
This is really problematic when using more than one HID interface because each interface will add these 154 bytes of code size while it just uses the exact code from its self.inner
interface with a different report type and some sugar on top (write/read_report, default_config).
Size optimization attempts
I made an attempt add a level of indirection to avoid implementations of InterfaceClass for all of these interfaces, and instead use AsInterfaceClass
which just returns reference to self.inner
with type &dyn InterfaceClass
. This is of course trading runtime performance for code size, but I believe that it's better to aim for size on no_std size-constrained embedded systems, and the runtime cost of this additional indirection should be minimal.
A PoC of the changes can be seen here. It additionally contains get_string
because I struggled to get the lifetimes for this method correct. Anyway, here is the result:
While this in fact eliminated the 3 implementations, it just swiped the code size into other places making the whole binary even bigger (by 2.2 kB). The change in usbd-human-interface-device/src
itself is only by ~0.2 kB.
As you can see the size is not split into RawInterface
and ManagedInterface
, and it looks like a lot of code size is for implememntations of PackedStruct::pack
. The problem is that ManagedInterface
only modifies 2 methods of RawInterface
so there is still a lot of unnecessary code size penalty for using it.
But as I said, this was only 0,2 kB in usbd-human-interface-device/src
and the rest of code size seems to go to fmt
(3.5 kB -> 4.9 kB, some Debug
formatting and hex formatting) and into packed_struct-0.10.0/src
(0 -> 344 B).
I wonder if PackedStruct
is a good choice here. Looking at dissassembly the packing implementation seems quite complicated with a lot of conditional branching, while packing of the HID reports should be pretty simple. Maybe it would be better to try re-using gen_hid_descriptor from usbd-hid-macros. Under the hood it seems to use serde with ssmarshal.
I will continue to investigate the problem. Hopefully it would be possible to make the code size comparable to usbd-hid, as I really like usbd-human-interface-device because it is more feature-complete and quite convenient to use, but the code size increase is a no-go for me as I already struggle with binary size. @dlkj do you have some ideas how to solve the size issue?
Thanks for lookin at this. So far I'd not considered the size aspect of the library. You make some good points and I'll have a good look at this as well.
I agree that dropping PackedStruct
is probably the way forward. It was very convenient during development but ultimately not worth the costs you have demonstrated.
Will have a look when I find some spare time.
I've had a quick look. I think the wins are:
- Removing panics (although you can get similar savings by switching to
panic-halt
orpanic-reset
) - Swapping
packed_struct
for something simpler - Easing off on the generics and wrapped classes
- Hunting down anything else that uses
core::fmt
The current code base was written to be right and easy to use. I'm pretty happy with the public interfaces, but I'm forming a plan around simplifying the internals. Your comments about the code sizes are another reason to look at the general implementation again.
I've made some good savings in the examples by using opt-level = "z"
and moving to panic-reset
. The logging seems to be cleanly removed by the compiler when disabled.
I've been using cargo-bloat
to dig into this. It seems to concur with your script.
I am currently using panic-probe and it looks like there is no difference when I switch to panic-halt
, so panic-reset
probably also wouldn't change much. I also tried opt-level = "z"
but the difference isn't huge, and I have some tight loop that might be taking advantage from a bit of vectorization.
I know cargo-bloat
and I also use it, but it's easier for me to diff code size changes using my script with the tree-view base on symbols' source files. In fact I wanted to implement code size diffing in my script, but I've never yet found time for that.
I feel like the way to go is to first replace packed_struct
. I suspect that all the error handling that packed_struct
might be doing may also have some impact on core::fmt
size. I started some work to replace it with usbd-hid-macros
on my branch. The changes are very much WIP, but I'm linking it here just so that you know, and to avoid potential work duplication if you decide to do something in this direction. I hope that I'll be able to have some results to compare tomorrow.
Btw, I did quick USB compliance tests using USB3CV tool and it looks like the tests passed. On the other hand, usbd-hid
has some failures as it doesn't implement some requests like GET/SET_IDLE and GET_REPORT). But I didn't spend much time on this, so I might have missed something important, I may do some more testing in the future, but in general I don't think this is very important unless someone specifically needs to pass these tests.
Looks like I found the cause of fmt
impact on size and was wrong about packed_struct
.
First things first, I finished the transition from packed_struct
and a PoC version looks like this, but it turned out to be even worse :/ But just in case, the code will be on poc/usbd-hid-macros.
Then I did more investigation of the fmt
calls in the binary itself and eventually I've found that it was fault in my implementation. I accidentally called .unwrap()
on UsbHidError
somewhere in my code without doing .map_err(drop)
(or something similar) first. This pulled all the formatting machinery.
So now the remaining problem is the code size duplication when using multiple interfaces, I'll give an update when I find a way to avoid it (probably need to use &dyn InterfaceClass
somewhere instead of generics).