google/stenographer

Stenographer Performance Issue

DominoTree opened this issue · 6 comments

Been running into a wall trying to get Stenographer to write more than 500/700 MiB/s to disk while monitoring 40gbyte/sec of traffic off of four bonded 10gBe interfaces.

We are running with 20 threads on a 44-core Xeon machine with 256GB of RAM and 20 raw 2TB SATA3 disks spread across three PCIe x8 controllers.

We have used ext3 and XFS with similar results, and have adjusted basically every tunable we can find, including with both cfq and deadline schedulers.

Any thoughts? The machines don't seem to be disk I/O-bound, CPU-bound, or even memory-bound. Pasted some strace stats below.

$ strace -cp 5034 -f

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 92.81  402.716075        8396     47966           clock_nanosleep
  3.75   16.284029         326     49875           poll
  1.96    8.508364      146696        58           restart_syscall
  1.48    6.404961         476     13457           io_submit
  0.00    0.018308           0    178853           clock_gettime
  0.00    0.004958           0     50392           io_getevents
  0.00    0.000030           0       134           getsockopt
  0.00    0.000000           0       134           write
------ ----------- ----------- --------- --------- ----------------
100.00  433.936725                340869           total

Hey, interesting workload! I'd specifically try the following (all flags to be added to the "Flags" list in the config, which are passed directly to stenotype):

  1. --preallocate_file_mb=4096 : makes sure that async IO is actually async... for some filesystems/kernels, if the file isn't preallocated the async IO will only happen sequentially
  2. --aiops=512 : defaults to 128, the max number of async writes to allow at once... if this makes a difference you can go up even higher
  3. higher blocksize, fewer blocks... right now the default is 2048 blocks of 1MB apiece, for 2G per thread. You could try for example 512 4MB blocks (--blocks=512 --blocksize_kb=4096) and see what happens.
  4. IRQ balancing, to balance your IRQs across processors (https://linux.die.net/man/1/irqbalance)

Really interested to see if any of those help you out... I expect the first is most likely to actually help, and I would recommend trying each in isolation to see what's up. It would also be good to get the MB/s going to each disk... I wonder if the AF_PACKET load balancing is overprovisioning a single thread/disk.

I'd be interested in the PCIe bus in general, as well... you've got 4x10Gb coming in and 3x8-lane coming out... for output, 3x8-lane maxes at 192Gbps, which should be fine, but is it using the same controller as the 4x10Gbps input? 80Gbps across the controller should still be handlable (40 in, 40 out) but probably stresses the bus more than it's used to.

Oh, I also noticed you said ext3... I'd specifically try ext4 to get extents. Not familiar with XFS, it might have them as well? Also, note that #1 is especially important for XFS, see comment on preallocate_file_mb and XFS in https://github.com/google/stenographer/blob/a12106bc615a8fb2761829f3d099dbfc3f641950/INSTALL.md

Wow, thanks for this! --preallocate_file_mb=4096 seems to have gotten us to 2.5GiB/s throughput and I don't seem to be dropping packets. I'm gonna let this run overnight and see what happens. Right now I'm running off of a build of the master branch from this afternoon.

We'll probably end up switching to ext4 just because it's what our other boxes run on.

Not entirely sure about where the 4x10Gb incoming ends up - if this seems to have any issues I'll dig in more - that's where I was gonna look next.

Thanks again, this is great!

Going to mark this closed for now, but if you're willing to provide a final update, I'd love to hear how your throughput is after the above changes. Also, feel free to reach out with any further questions or to reopen this issue if your required throughput is not yet attained.

Quick ping on this... If you've kicked the tires more, I'd love to hear your current performance numbers 😊

Haven't done anything further past the --preallocate_file_mb=4096 setting, but we're sustaining nearly 3GiB/s write speeds on each machine and queries still return reasonably quickly (for result sets that aren't huge anyways)

Currently working on an automated system to pull out pcaps from Stenographer based on IDS alerts from Suricata