dlkj/usbd-human-interface-device

Code size unnecessarily grows per each interface

Closed this issue · 5 comments

Hi, as mentioned last week, I decided to transition from usbd-hid to this crate. I was able to conveniently recreate the functionality that I previously implemented using usbd-hid, which now is:

UsbHidClassBuilder::new()
    .add_interface(KeyboardInterface::default_config())
    .add_interface(MouseInterface::default_config())
    .add_interface(ConsumerInterface::default_config())
    .build(bus)

For reference, my implementation that used usbd-hid was basically creating 3 separate instances of usbd_hid::HIDClass with different settings and reports, then passing all of these to usb_dev.poll(&[...]).

Comparing my binary size before and after the transition I noticed that the size increased by around 5 kB for a release build with opt-level = "s" and lto = "fat"). I can see that usbd-hid is much simpler and lacks implementation of some features, like GET/SET_IDLE and GET_REPORT, but 5 kB is a huge difference when having 64 kB of flash and already using over 40 kB.

Problem details

I did some investigation and have the following findings:

  • usbd-human-interface-device/src takes around 3.5 kB
  • usbd-hid-0.6.1/src takes 982 B
  • some code of usbd-hid might have gone "into my application source code", but the difference is at most 200 B (my code with usbd-hid was 200 B larger)
  • a lot of code size comes from increased fmt size (1.8 kB before, 3.5 kB after)
  • usbd-human-interface-device introduced str/mod.rs taking 370 B (two functions: core::str::slice_error_fail_rt and core::str::slice_error_fail)
  • changes to the sizes of the rest of modules are minimal

Now for some details here are some screenshots from the reports generated with elf-size-analyze:

  • usbd-human-interface-device:
    v1_usbd-human-interface-device
  • usbd-hid:
    v1_usbd-hid
  • comparison of fmt size
    v1_fmt

As you can see, there are two separate problems. One is all the formatting code, I tried removing all the calls to log::info!, etc. but this made no difference so all the logging seems to be optimized out as unused, so my guess is that it is due to panics in the code, but I will try to investigate it further.

Code duplication

The second issue is that there is a lot of code duplication due to usage of generics. Each interface has separate implementation of InterfaceClass and these functions are not optimized out.

Some of these functions are just "trampolines" e.g.

08005be8 <<usbd_human_interface_device::device::keyboard::BootKeyboardInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_idle::h0e5b58f60dacd33f>:
 8005be8:	b580      	push	{r7, lr}
 8005bea:	af00      	add	r7, sp, #0
 8005bec:	f7fe fbca 	bl	8004384 <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_idle::h3398285bc300a45b>
 8005bf0:	bd80      	pop	{r7, pc}

08005c4c <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::reset::h9cc9a8850225cee5>:
 8005c4c:	b580      	push	{r7, lr}
 8005c4e:	af00      	add	r7, sp, #0
 8005c50:	f7fe fb6c 	bl	800432c <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::reset::hed6d4de382068139>
 8005c54:	bd80      	pop	{r7, pc}

08005c56 <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::set_report::h14ad00db97731e9b>:
 8005c56:	b580      	push	{r7, lr}
 8005c58:	af00      	add	r7, sp, #0
 8005c5a:	f7fe f9fb 	bl	8004054 <<usbd_human_interface_device::interface::raw::RawInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::set_report::h81c003fa6383d9b7>
 8005c5e:	bd80      	pop	{r7, pc}

or simple inlined ones:

08005bfa <<usbd_human_interface_device::device::keyboard::BootKeyboardInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_protocol::h1da62ffbb86f55b1>:
 8005bfa:	21ff      	movs	r1, #255	; 0xff
 8005bfc:	31c8      	adds	r1, #200	; 0xc8
 8005bfe:	5c40      	ldrb	r0, [r0, r1]
 8005c00:	4770      	bx	lr

08005c90 <<usbd_human_interface_device::device::consumer::ConsumerControlInterface<B> as usbd_human_interface_device::interface::InterfaceClass>::get_protocol::hb7313df2838bd99a>:
 8005c90:	21ff      	movs	r1, #255	; 0xff
 8005c92:	31c8      	adds	r1, #200	; 0xc8
 8005c94:	5c40      	ldrb	r0, [r0, r1]
 8005c96:	4770      	bx	lr

This is really problematic when using more than one HID interface because each interface will add these 154 bytes of code size while it just uses the exact code from its self.inner interface with a different report type and some sugar on top (write/read_report, default_config).

Size optimization attempts

I made an attempt add a level of indirection to avoid implementations of InterfaceClass for all of these interfaces, and instead use AsInterfaceClass which just returns reference to self.inner with type &dyn InterfaceClass. This is of course trading runtime performance for code size, but I believe that it's better to aim for size on no_std size-constrained embedded systems, and the runtime cost of this additional indirection should be minimal.

A PoC of the changes can be seen here. It additionally contains get_string because I struggled to get the lifetimes for this method correct. Anyway, here is the result:
v2_usbd-human-interface-device
While this in fact eliminated the 3 implementations, it just swiped the code size into other places making the whole binary even bigger (by 2.2 kB). The change in usbd-human-interface-device/src itself is only by ~0.2 kB.

As you can see the size is not split into RawInterface and ManagedInterface, and it looks like a lot of code size is for implememntations of PackedStruct::pack. The problem is that ManagedInterface only modifies 2 methods of RawInterface so there is still a lot of unnecessary code size penalty for using it.

But as I said, this was only 0,2 kB in usbd-human-interface-device/src and the rest of code size seems to go to fmt (3.5 kB -> 4.9 kB, some Debug formatting and hex formatting) and into packed_struct-0.10.0/src (0 -> 344 B).

I wonder if PackedStruct is a good choice here. Looking at dissassembly the packing implementation seems quite complicated with a lot of conditional branching, while packing of the HID reports should be pretty simple. Maybe it would be better to try re-using gen_hid_descriptor from usbd-hid-macros. Under the hood it seems to use serde with ssmarshal.

I will continue to investigate the problem. Hopefully it would be possible to make the code size comparable to usbd-hid, as I really like usbd-human-interface-device because it is more feature-complete and quite convenient to use, but the code size increase is a no-go for me as I already struggle with binary size. @dlkj do you have some ideas how to solve the size issue?

dlkj commented

Thanks for lookin at this. So far I'd not considered the size aspect of the library. You make some good points and I'll have a good look at this as well.

I agree that dropping PackedStruct is probably the way forward. It was very convenient during development but ultimately not worth the costs you have demonstrated.

Will have a look when I find some spare time.

dlkj commented

I've had a quick look. I think the wins are:

  • Removing panics (although you can get similar savings by switching to panic-halt or panic-reset)
  • Swapping packed_struct for something simpler
  • Easing off on the generics and wrapped classes
  • Hunting down anything else that uses core::fmt

The current code base was written to be right and easy to use. I'm pretty happy with the public interfaces, but I'm forming a plan around simplifying the internals. Your comments about the code sizes are another reason to look at the general implementation again.

I've made some good savings in the examples by using opt-level = "z" and moving to panic-reset. The logging seems to be cleanly removed by the compiler when disabled.

I've been using cargo-bloat to dig into this. It seems to concur with your script.

I am currently using panic-probe and it looks like there is no difference when I switch to panic-halt, so panic-reset probably also wouldn't change much. I also tried opt-level = "z" but the difference isn't huge, and I have some tight loop that might be taking advantage from a bit of vectorization.

I know cargo-bloat and I also use it, but it's easier for me to diff code size changes using my script with the tree-view base on symbols' source files. In fact I wanted to implement code size diffing in my script, but I've never yet found time for that.

I feel like the way to go is to first replace packed_struct. I suspect that all the error handling that packed_struct might be doing may also have some impact on core::fmt size. I started some work to replace it with usbd-hid-macros on my branch. The changes are very much WIP, but I'm linking it here just so that you know, and to avoid potential work duplication if you decide to do something in this direction. I hope that I'll be able to have some results to compare tomorrow.

Btw, I did quick USB compliance tests using USB3CV tool and it looks like the tests passed. On the other hand, usbd-hid has some failures as it doesn't implement some requests like GET/SET_IDLE and GET_REPORT). But I didn't spend much time on this, so I might have missed something important, I may do some more testing in the future, but in general I don't think this is very important unless someone specifically needs to pass these tests.

Looks like I found the cause of fmt impact on size and was wrong about packed_struct.

First things first, I finished the transition from packed_struct and a PoC version looks like this, but it turned out to be even worse :/ But just in case, the code will be on poc/usbd-hid-macros.

Then I did more investigation of the fmt calls in the binary itself and eventually I've found that it was fault in my implementation. I accidentally called .unwrap() on UsbHidError somewhere in my code without doing .map_err(drop) (or something similar) first. This pulled all the formatting machinery.

So now the remaining problem is the code size duplication when using multiple interfaces, I'll give an update when I find a way to avoid it (probably need to use &dyn InterfaceClass somewhere instead of generics).

dlkj commented

Closing this due to #76 and the improvments made in v0.4.1. All buffers sizes are now defined at compile time using type level generics.