How I started working on network device drivers
lukego opened this issue ยท 7 comments
This is a personal anecdote / war story about how I came to start writing ethernet device drivers. This is naturally a bit self-indulgent, and I apologise for that in advance, but it might help to provide some useful historical context for Snabb too.
It all started around 2011 at a tiny startup company called Teclo Networks when we were building the product that would become known as Sandvine TCP Accelerator. We had just written an extremely customized TCP/IP stack from scratch and deployed this on a single server to optimize all the 3G internet traffic for a whole country. (Juho Snellman has told that whole story.) The next problem on the agenda was to revisit our I/O interfaces and make sure they would work well for future deployments.
Our initial product release used 10G NICs from Myricom in the "Sniffer10G" firmware mode. These NICs and their drivers were absolutely excellent and a joy to use. (The software interface was so simple that we didn't ever care it was proprietary.) On the other hand we had some new requirements in the pipeline and some concerns about the future:
- We should support high traffic rates spread across many 1G ethernet ports, but Myricom only supported 10G.
- We should always deploy a hardware bypass function so that our appliances are resilient to hardware failures even when deployed inline. We were already using Napatech optical bypass adapters but these would not work for 1G copper deployments.
- We should have a simple and future-proof solution. Myricom seemed to be struggling commercially and were changing their licensing policies. We were especially concerned about potentially having to include per-device license files in our support and deployment routines (one more thing that could go wrong.)
We surveyed the available hardware and decided the best option would be Silicom network cards. These were available for 1G and 10G, in both passive-optical and active-copper bypass configurations, and they had excellent port density to put our PCIe capacity to use. The only problem was that they used Intel ethernet controller chips and so we would have to find a replacement for the Myricom Sniffer10G software library that we had been happily using in the past.
So what software library should we use for I/O?
The idea of writing our own drivers did not even cross our minds at this point, even though I had written an Intel HDAudio driver in Forth for the OLPC XO firmware only a couple of years earlier. Instead we started looking for off-the-shelf solutions... DPDK was not open source and we were too small to properly engage with Intel, ntop.org PF_RING DNA had awkward licensing (the price was fine but not having to manage the license files), and the various Linux kernel interfaces for high-speed memory-mapped I/O didn't perform well enough. (We didn't consider netmap... I suspect that it didn't exist yet.)
While we evaluated these ideas we stumbled upon the Intel datasheets and started to have ideas for better ways to do things ourselves. I remember reading them on the bus to work and having ideas for hacks like packetblaster
.
Then at some point we just did it. I hacked a proof-of-concept to load a Linux kernel driver and then "perform a lobotomy" by disabling interrupts so that it would become completely passive. Then I poked the descriptor ring registers to point at a block of reserved physical memory and started driving DMA in poll-mode from our userspace process. (This was inspired by Luca Deri's work.) I showed that to Juho, half joking, and he very quickly whipped up a production-ready version that fit all of our requirements perfectly. So that became our standard I/O option going forward and we replaced the old Myricom cards with Silicom/Intel ones. It felt like a great hack!
Later when I started working on Snabb the obvious first step was to call Silicom and order 20x10G ports (at mate's rates -- thanks!) and write a whole new driver from scratch and see what interesting things we could do with it. And the rest is history-in-the-marking as we do lots of nice hacking in the Snabb community using custom drivers as the bottom level of our foundation.
(I'm sure that I am misremembering this and should be crediting more clever hacks to Juho, Tobias Rittweiler, Ties Stuij, Christophe Rhodes, and Sean Hinde. Likely I didn't really reserve memory at boot on the PoC but simply picked some at random. It was a long time ago now!)
Nostalgia mode... I found this old correspondence from 2011:
@lukego: I reflected that it seems a bit odd to be so impressed with Myricom. as far as I know their hardware is no better than anybody else's and the only major difference is that they wrote a device driver in user-space. I've written some device drivers and they are not really that hard. Shouldn't proper hackers like us be able to whip up our own drivers for e.g. the DL380's Broadcom chips or Intel's 10G chips and get the same performance as Myricom?
@jsnell: Implementing our own fully userspace 10G driver seems like madness to me, but I could be overestimating the amount of stupid crap that you'll end up having to deal with :-)
Could be the jury is still out on this ;-)
I've also considered writing a driver for the Intel 82599 for similar reasons in around 2012 (?); but I was scared by the perceived complexity after I've found some documents on the proprietary version of DPDK.
DPDK was running on bare metal back then (and that still shows in parts of the API today). I was like "nope, don't want to write a whole OS". Turns out that approach is neither necessary nor useful and user space drivers are easy.
Anyways, thanks for Snabb! I've learned a lot about user space drivers and LuaJIT from your code :)
"Some new requirements in the pipeline and some concerns about the future" is perhaps understating things a bit. By far the best opportunity we had in the sales pipeline needed 4 pairs of 1G copper cards + bypass, and we did not think they'd accept some Rube Goldberg network intergration multiple switches for one machine. So it was more like corporate life or death.
My recollection is that the timeline was something like:
- Mid-January you wrote the initial POC. Took about a week from first code to two NICs ping-ponging an initial packet that was manually inserted to the descriptor ring.
- End of January a slightly modified version of that POC was installed as an alternate IO backend in the product, but only capable of passing traffic through unmodified.
- End of February we had a fully productionized version with all kinds of niceties like zero-copy even for packets we buffered for indefinite amounts of time. (And even today not doing a zero copy on packet retention would cost that product about 20% performance; I know that for Snabb you can treat memory copies as basically free, but for whatever reason that's not the case here.)
- Sometime in March it was installed for a live trial with half the operator's traffic.
One reason it ended up being such a smooth process is that the stub kernel driver could still be used to support ifconfig, ethtool, libnl, etc. So all of the existing O&M infrastructure just worked with no changes. It was almost a classic data plane vs. control plane split, except with the roles inverted :) Exporting the NICs as normal kernel interfaces, just ones that do no traffic, is to my mind still the right way to do things.
And what about the life or death situation? Well, the sale happened, which was nice. But it turns out the customer had a habit of paying invoices months or years late. And we were on a strict runway: there was an exact date at which we'd need to start irrevocably winding down the company. In the end we were paid with a couple of weeks of runway left. If writing, integrating and testing those user-space drivers from scratch had taken any more than the ~6 weeks it did, it would not have been a happy ending.
So in retrospect that project was probably madness; the schedule was far too tight, we just didn't realize it and got lucky. And there was a lot of stupid crap over the years that could be traced to this decision, just like predicted. (Like the hilarious time we crashed all of a WiMax operator's base stations due to a MAC stripping bug in the userspace driver, or the mystery of the processes that ran at half speed if and only if they had direct /dev/mem mappings both above and below address 0x400000000, or depending on hardware features we really would have been better off without just because we could). But in all honesty the density of stupid crap was roughly the same even after that system was switched to using DPDK.
@emmericp You're welcome! I love what you are doing with ixy now too :). I'd love a universe where we compete to reduce line count rather than to absorb new vendor code the fastest :).
I know you are billing it as educational but ixy
looks like the beginnings of something that we in Snabb-land could contribute to and potentially use as a vehicle to share drivers with other projects. Our trouble with existing drivers isn't that they are written in C - that's no obstacle for us - but that they want to dictate our core data structures, go wild with feature creep, act as a vector for vendors to inject code into our applications without meaningful review, etc.
But we are not in a hurry to retire our built-in Lua drivers either and in these early days it may be best to have multiple "competing" approaches to feed each other with ideas.
@jsnell The D-I-Y pitch could be "Writing drivers is much less crazy than running thinly-capitalized startup companies, and people do that all the time." ๐
It's really hard to assess the buy-vs-build trade offs. Do the complex frameworks really get you off the hook with hard problems or does it just mean you will be debugging them in production with a large unfamiliar codebase? If it were the former then it would probably be worth holding your nose and drinking the kool-aid.
I reported one extremely nasty bug on DPDK that I reckon would have been beyond my capabilities to find in that gigantic codebase in a production setting. It was memory corruption in an external process triggered by a very rare race condition ultimately due to a missing memory barrier. The sort of thing that might happen once per month and could have been anywhere (e.g. in the vswitch, in QEMU, in the kernel inside a VM, etc.)
I spent an easter weekend tracking that down even in the tiny Snabb code base and with a controlled test environment. I wondered if using DPDK would have saved me that pain but, nope, it would have amplified it because they had exactly the same bug in a much more complex codebase.
I hope I saved somebody some grey hairs by reporting that one before it reached production :).
EDIT: I misremembered the details of that bug. It was a hang rather than a memory corruption. The principle is the same though :).
I was scared by the perceived complexity after I've found some documents on the proprietary version of DPDK.
This is such a huge barrier, right? People see complex driver code and say "wow, hardware is complex!" But the main source of the visible complexity is the software frameworks that the drivers have to interface with e.g. Linux kernel and DPDK. This is a little known secret and has been for decades already.
I was influenced by contributing to Openfirmware on OLPC. There you always have a simple Forth driver in the firmware (used mostly for hardware debugging) and a complex driver in Linux (used by general applications.) The firmware drivers were nice because you write them and then they are done. The kernel drivers looked more like an endless task of keeping in sync with upstream, debugging "is it hardware or is it software?" problems in a complex environment, etc.
The hardware was sometimes funny to program too :). My firmware audio driver needed to play the startup jingle when you press the power button. I wanted to do this without special interrupt handlers but the hardware could only loop a sample and not automatically stop when done. I couldn't poll because the CPU was needed for loading the kernel etc. I think that I ultimately grumbled about interrupt-centric hardware and did some polling for completion on a generic timer interrupt :).
I didn't want to plug my projects in your thread. It's really only meant for education because that just allows us to just avoid implementing lots of boring stuff and keep the driver and example apps simple. Example apps like DPDK's l2fwd are just horrible if you only want to see how it works.
I'm also trying to find students to re-implement my ixgbe driver in some other languages; I'm currently thinking of OCaml (because of MirageOS), Rust, some JVM language and maybe Python.
Regarding Snabb drivers: I've wondered how Snabb would work with DPDK as a backend.
Yes, that sounds like a horrible abomination ;)
I've a student doing a bachelor's thesis implementing a proof of concept for a "DPDKDeviceApp" and wrappers for the intel apps. Goal is to run lwaftr with DPDK drivers without modifications to lwaftr, probably out of scope/too much work for the thesis.