C++ low-latency Order Book Joshua McKenzie joshua.mckenzie@gmail.com This was written as an exercise to test out a variety of techniques in an ultra-low-latency environment. I limited myself to 5 days from clean slate to full implementation, so some concessions were made on the implementation - exception handling being the primary one. Goals: Write human readable view of in-memory book every 10 messages Write machine-readable series of mid-quotes On each trade message, write total qty traded at most recent trade price Gracefully handle various garbage messages How to generate sample data and get performance data out of the software: OrderGenerator.pl should generate a stream of orders that you can use as input. An easy reference for looking at performance metrics: ./run_test.sh Or alternatively: ./OrderGenerator.pl > Orders.txt make profile ./OrderBookProcessor Orders.txt > test.out 2>&1 less -n test.out Development and testing environment: 2.7GHz, quad-core i7 920, HT enabled Ubuntu 13.10 3.8.0-19-generic #30-Ubuntu SMP gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1) Ubuntu clang version 3.2-1~exp9ubuntu1 boost 1.54.0 perl v5.14.2 Data Structure choices and Performance Aspects: Meta data structure: std::map of doubly-linked lists of orders, one per side std::unordered_map of all orders to get memory address of underlying order object Due to needint to keep sorted order on the book, I went with a std::map, rbtree implementation for the following performance profiles: Insert: logN price + hash order id Modify: logN to 2logN price + hash order id, depending on if price changed Delete: logN + hash Trade: 2logN price + variable hashing order id, based on # of orders blown through The data scope of the logN operation is limited to the price levels rather than being wide-open to all order_id's which should limit the scope of the logN cost we have to pay on lookup. Originally I considered a vector with integer-based indexing on price with pre-reserved memory, however calculating midpoint on each update and checking for a crossed book would represent a worst-case O(n) on *every single order*, which was clearly unacceptable. Changing from the vector to the map added about 40% latency or so to orderAdd and the other types which I wasn't too thrilled about, but it's necessary. Better logN on insert and staying in sorted order than paying NlogN on every insert just to get O(1) on trade message, though the thought crossed my mind as your knowledge about market conditions is often-times based on getting a fill before the market- data update hits. The map is of ulonglong price values. On inbound messages, I parse and *= 100 the value to avoid double and/or float comparisons in the order book and get single integer comparison on the map. While this only supports prices up to 92.2 quadrillion, I chose to go that route to bypass whole part / fract part Fixed class creation and comparison, as well as to support the initial design of integer based vector offset lookup mentioned above. This class design would need to be augmented to support a Fixed class with a functor comparison object to allow the std::map to sort and compare via that key to have full double price support at the top of the price range. It would be as easy as plugging the Fixed class into the OrderBookMap and templatizing the price value throughout the OrderBook in order to change over to that class. In a production environment, I'd ask my client if it was acceptable for them to max at 92.2 quadrillion in price, and if so, go on with things. The map has doubly-linked lists of pointers to orders hanging off them. Rather than paying a O(n) worst-case to traverse that and finding an order, I created a hash for the orders that should average to O(1). I didn't do analysis on the default hash bucketing that took place, but integer hashing on order_id with a monotonically increasing list of #'s should be pretty evenly distributed. I didn't use a standard container here as I wanted to be able to hash directly to a memory address and ask the container to delete by that address but still retain time priority. I'd be interested to see how shared_ptr's behaved from boost instead of raw pointers - as it is now the pointer maintenance inside the OrderBook represents both a high risk and a point of high mental tax on future work on the code-base. I went with boost::spirit::qi parsing as it's demonstrably faster than atoi and atof in all the benchmarks I was pulling up (up to 5x faster). strtok still reigns supreme, though it means just about everything in there is non-const thanks to underlying buffer modification. Logging is done in a separate thread, since fprintf to stderr was taking a pretty painful amount of time on just printing the book. Printing a midquote every order means we need fast IO every order... downside? Crashing means lost logging state. So don't crash. ;) I didn't use exception handling in this exercise due to time constraints on implementation. The Logger burns a core. Given the various calls median < 1 usec, any amount of sleeping at usec resolution could get our logging very very backed up. As to why I burned a core instead of using a conditional variable - those things can take a long time to get scheduled by the kernel - up to 10 usec or so from what I can recall off the top of my head. Certainly an order of magnitude longer than the time horizons I'm looking at working with. The book printing process is pretty mangled at this point, sprintf'ing into temporary buffers and then finally logging it. fprintf(stderr ...) on every message / sub-message for book-level printing was taking 1+ ms every 10 orders. I managed to get it down to ~ 150 usec per print, depending on how deep the book is. That could be further optimized (char* on heap, logger deletes, variadic template arguments into logger instead w/different behavior per data type, etc) but for now, I left it as is as I ran out of time. It DOES make the printBook function very ugly however. ** 2 points of interest: 1) The OrderGenerator.pl should be parameterized. Right now it's hard-coded and can produce some very odd books 2) The program doesn't run with CLANG as there's apparently a known bug with 11x threads and clang that causes an exception on thread instantiation with binding a function. Compile w/clang for clean errors, switch to gcc for release. 2.A) UPDATE: Apparently they've fixed it since I wrote this. Will be interesting to do some benchmarking w/CLANG once I'm not on a VM and can do some bare-metal testing 3) RDTSC on an Oracle VM will give you incorrect results as they virtualize calls to the TSC. 3+ usec on most operations in this book instead of the 300-600 nanosecond on bare metal. Performance statistics: MidQuote print went from 3100 nanoseconds down to 679 median by implementing threaded logger_. Note: io on critical path is bad (except network / necessity). Book print is still very very heavy on perf... sitting at 150-200 usec or so, once per 10 orders when the book is at its heaviest. Not too happy about that, but optimized down from 1.3ms on printf based solution... The rollowing metrics are in nanoseconds addOrder: samples: 137872 median: 484 95th: 658 99th: 1005 99.99th: 7106 modifyOrder: samples: 33944 median: 539 95th: 773 99th: 1014 99.99th: 6339 removeOrder: samples: 31645 median: 547 95th: 705 99th: 886 99.99th: 7279 trade: samples: 105995 median: 1252 95th: 1650 99th: 2238 99.99th: 10186 Makefile: make release - build the release without performance metric gathering or debug symbols make profile - build the release with rdtsc performance metric gathering but no debug symbols make debug - build with asserts available (though not widely used), debug printing that was compiled out, book updates on every valid message