/OrderBook

C++ low-latency in-memory order book

Primary LanguageC++

C++ low-latency Order Book
Joshua McKenzie
joshua.mckenzie@gmail.com

This was written as an exercise to test out a variety of techniques in
an ultra-low-latency environment.  I limited myself to 5 days from
clean slate to full implementation, so some concessions were made on
the implementation - exception handling being the primary one.

Goals:
   Write human readable view of in-memory book every 10 messages
   Write machine-readable series of mid-quotes
   On each trade message, write total qty traded at most recent trade price
   Gracefully handle various garbage messages

How to generate sample data and get performance data out of the software:
   OrderGenerator.pl should generate a stream of orders that you can
      use as input.
   An easy reference for looking at performance metrics:
      ./run_test.sh
   Or alternatively:
      ./OrderGenerator.pl > Orders.txt
      make profile
      ./OrderBookProcessor Orders.txt > test.out 2>&1
      less -n test.out

Development and testing environment:
   2.7GHz, quad-core i7 920, HT enabled
   Ubuntu 13.10
   3.8.0-19-generic #30-Ubuntu SMP
   gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1)
   Ubuntu clang version 3.2-1~exp9ubuntu1
   boost 1.54.0
   perl v5.14.2

Data Structure choices and Performance Aspects:
   Meta data structure:
      std::map of doubly-linked lists of orders, one per side
      std::unordered_map of all orders to get memory address of
         underlying order object
   Due to needint to keep sorted order on the book, I went with a
      std::map, rbtree implementation for the following performance
      profiles:
         Insert: logN price + hash order id
         Modify: logN to 2logN price + hash order id, depending on if
            price changed
         Delete: logN + hash
         Trade: 2logN price + variable hashing order id, based on # of
            orders blown through
      The data scope of the logN operation is limited to the price
         levels rather than being wide-open to all order_id's which
         should limit the scope of the logN cost we have to pay on
         lookup.
   Originally I considered a vector with integer-based indexing on
      price with pre-reserved memory, however calculating midpoint on
      each update and checking for a crossed book would represent a
      worst-case O(n) on *every single order*, which was clearly
      unacceptable.  Changing from the vector to the map added about 40%
      latency or so to orderAdd and the other types which I wasn't too
      thrilled about, but it's necessary.  Better logN on insert and
      staying in sorted order than paying NlogN on every insert just to
      get O(1) on trade message, though the thought crossed my mind as
      your knowledge about market conditions is often-times based on
      getting a fill before the market- data update hits.
   The map is of ulonglong price values.  On inbound messages, I parse
      and *= 100 the value to avoid double and/or float comparisons in
      the order book and get single integer comparison on the map.  While
      this only supports prices up to 92.2 quadrillion, I chose to go
      that route to bypass whole part / fract part Fixed class creation
      and comparison, as well as to support the initial design of integer
      based vector offset lookup mentioned above.  This class design
      would need to be augmented to support a Fixed class with a functor
      comparison object to allow the std::map to sort and compare via
      that key to have full double price support at the top of the price
      range.  It would be as easy as plugging the Fixed class into the
      OrderBookMap and templatizing the price value throughout the
      OrderBook in order to change over to that class.  In a production
      environment, I'd ask my client if it was acceptable for them to max
      at 92.2 quadrillion in price, and if so, go on with things.
   The map has doubly-linked lists of pointers to orders hanging off
      them.  Rather than paying a O(n) worst-case to traverse that and
      finding an order, I created a hash for the orders that should
      average to O(1).  I didn't do analysis on the default hash
      bucketing that took place, but integer hashing on order_id with a
      monotonically increasing list of #'s should be pretty evenly
      distributed.  I didn't use a standard container here as I wanted to
      be able to hash directly to a memory address and ask the container
      to delete by that address but still retain time priority.
   I'd be interested to see how shared_ptr's behaved from boost
      instead of raw pointers - as it is now the pointer maintenance
      inside the OrderBook represents both a high risk and a point of
      high mental tax on future work on the code-base.
   I went with boost::spirit::qi parsing as it's demonstrably faster
      than atoi and atof in all the benchmarks I was pulling up (up to 5x
      faster).
   strtok still reigns supreme, though it means just about everything
      in there is non-const thanks to underlying buffer modification.
   Logging is done in a separate thread, since fprintf to stderr was
      taking a pretty painful amount of time on just printing the book.
      Printing a midquote every order means we need fast IO every
      order... downside?  Crashing means lost logging state.  So don't
      crash.  ;) I didn't use exception handling in this exercise due to
      time constraints on implementation.
   The Logger burns a core.  Given the various calls median < 1 usec,
      any amount of sleeping at usec resolution could get our logging
      very very backed up.
   As to why I burned a core instead of using a conditional variable -
      those things can take a long time to get scheduled by the kernel -
      up to 10 usec or so from what I can recall off the top of my head.
      Certainly an order of magnitude longer than the time horizons I'm
      looking at working with.
   The book printing process is pretty mangled at this point,
      sprintf'ing into temporary buffers and then finally logging it.
      fprintf(stderr ...) on every message / sub-message for book-level
      printing was taking 1+ ms every 10 orders.  I managed to get it
      down to ~ 150 usec per print, depending on how deep the book is.
      That could be further optimized (char* on heap, logger deletes,
      variadic template arguments into logger instead w/different
      behavior per data type, etc) but for now, I left it as is as I ran
      out of time.  It DOES make the printBook function very ugly
      however.
   ** 2 points of interest:
      1) The OrderGenerator.pl should be parameterized.  Right now
         it's hard-coded and can produce some very odd books
      2) The program doesn't run with CLANG as there's apparently a
         known bug with 11x threads and clang that causes an exception on
         thread instantiation with binding a function.  Compile w/clang
         for clean errors, switch to gcc for release.
      2.A) UPDATE: Apparently they've fixed it since I wrote this.
         Will be interesting to do some benchmarking w/CLANG once I'm not
         on a VM and can do some bare-metal testing
      3) RDTSC on an Oracle VM will give you incorrect results as they
         virtualize calls to the TSC.  3+ usec on most operations in this
         book instead of the 300-600 nanosecond on bare metal.

Performance statistics:
   MidQuote print went from 3100 nanoseconds down to 679 median by
      implementing threaded logger_.  Note: io on critical path is bad
      (except network / necessity).
   Book print is still very very heavy on perf... sitting at 150-200
      usec or so, once per 10 orders when the book is at its heaviest.
      Not too happy about that, but optimized down from 1.3ms on printf
      based solution...

   The rollowing metrics are in nanoseconds
   addOrder:
      samples: 137872
      median:  484
      95th:    658
      99th:    1005
      99.99th: 7106
   modifyOrder:
      samples: 33944
      median:  539
      95th:    773
      99th:    1014
      99.99th: 6339
   removeOrder:
      samples: 31645
      median:  547
      95th:    705
      99th:    886
      99.99th: 7279
   trade:
      samples: 105995
      median:  1252
      95th:    1650
      99th:    2238
      99.99th: 10186

Makefile:
   make release - build the release without performance metric
      gathering or debug symbols
   make profile - build the release with rdtsc performance metric
      gathering but no debug symbols
   make debug - build with asserts available (though not widely used),
      debug printing that was compiled out, book updates on every valid
      message