logstash-plugins/logstash-codec-netflow

Ingesting IPFIX Flows From YAF Oddly Slow

Zalgo2462 opened this issue · 8 comments

I've been benchmarking the netflow codec for a tool that stitches IPFIX/ netflow data together into bi-directional sessions. The codec seems to run well when ingesting data from SonicWall and Cisco devices. However, the codec runs oddly slow when ingesting flows produced by YAF.

Test Machine Specifications:

  • DigitalOcean CPU Optimized Instance
  • 32GB of RAM
  • 16 vCPUs
  • 16 udp workers
  • 64MB UDP buffer (rmem_max)
  • LS_JAVA_OPTS=-Xms8g -Xmx8g
  • Null logstash output
  • logstash-codec-netflow version 4.1.0
  • logstash version 6.2.4
  • XPack Monitoring

Known benchmarks:

  • netflow_bench_cisco_asr.py: 18200 flows/ second
  • netflow_bench_cisco_asa.py: 11500 flows/ second

First, I ran yaf with the following command: yaf --uniflow --in test.pcap --out my.logstash.ip --ipfix-port 2055 --ipfix udp and saw that I was only processing 3000 flows/ second.

I did some packet captures and saw that about one fifth of the data was templates. In order to make a better comparison to known benchmarks, I created a similar script (had to change file extension to txt).

With the script, the codec manages 4000 flows/ second.

Running yaf with --silk doesn't seem to dramatically change the performance characteristics. I didn't create a test script for silk style flows, but when running the tool, adding --silk resulted in 200 less flows/ second on average.

I know it is not IPFIX specific since I created a similar test script (can be provided) for flows based on SonicWall's IPFIX implementation. The SonicWall IPFIX benchmark managed 17500 flows/ second.

I know it is not template-processing related since the benchmark only sends the template once.

The YAF flows encode timestamps as flowStartMilliseconds which need a simple conversion to iso8601, while the sonicwall flows only include systemInit relative timestamps, so they don't need a conversion. I thought this may be the cause, but netflow v9 also requires the timestamp conversion on the cisco data sets. The cisco data sets perform fine.

The YAF flows do have a subTemplateList but that should be efficiently handled since the field is labeled :skip in the definitions file.

I thought the data size of each flow may affect the speed at which the flows are processed. However, the YAF benchmark bundles 25 flows per packet with an overall packet size of 1415 bytes, while the cisco asr benchmark bundles 21 flows per packet with an overall packet size of 1392 bytes. The cisco asr actually sends larger flows (~66 bytes) to logstash than YAF does (~56 bytes). Yet, YAF is processed much slower.

The YAF flows do include an enterprise field, but the field flowAttributes is well accounted for in the definitions file.

I've looked through the code and I can't work out what is causing the slow down. However, I'm not too good at ruby to be honest.

The script also includes the wireshark descriptions of the template and data packets.

After more benchmarking, it seems as if processing YAF flows doesn't scale vertically.

The following was captured using Digital Ocean CPU-Optimized Droplets while running MongoDB, Logstash, Elasticsearch, and Kibana. Elasticsearch and Kibana were needed in order to collect the data.

The system was benchmarked with the scripts here as well as with two yaf commands run directly on a sample pcap file.

The system caps out at around 13000 flows/ second due to the MongoDB output plugin and database in the pipeline.

flows_ sec vs workers

codec version RAM -xmx -xms GB udp buffer MB vCPU Workers CPU Optimized Mongo YAF flows /sec sonicWALL IPFIX flows/ sec sonicWALL v9 flows/ sec Cisco ASA flows/ sec Cisco ASR flows/ sec yaf sample.pcap --no-tombstone --no-stats --silk flows/ sec yaf sample.pcap --no-tombstone --no-stats flows/ sec
4.1.0 32 8 64 16 16 1 1 4000 13250 11500 11500 13500 2850 3000
4.1.0 16 6 64 8 8 1 1 3600 10000 6750 11100 11750 2500 2800
4.1.0 16 6 64 6* 6 1 1 3500 8700 5600 10000 12900 2500 2650
4.1.0 16 6 64 4* 4 1 1 3100 6200 3750 6800 9200 2100 2350
4.1.0 16 6 64 2* 2 1 1 2350 3200 1900 3600 4750 1550 1650

* CPU limited via docker cpu scheduler on 8 core machine

EDIT: Added more details for each benchmark

YAF flows /sec sonicWALL IPFIX flows/ sec sonicWALL v9 flows/ sec Cisco ASA flows/ sec Cisco ASR flows/ sec
Flows/ packet 25 5 5 14 21
Packet Size 1415 285 289 1452 1392
Rough Avg Flow Size 56.6 57 57.8 103.7142857 66.28571429

The YAF flows use flowStartMilliseconds (152) and flowEndMilliseconds (153) fields (the Cisco flows don't use them) so take a slightly other path in the code:

case k.to_s
when /^flow(?:Start|End)Seconds$/
event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot).to_iso8601
when /^flow(?:Start|End)(Milli|Micro|Nano)seconds$/
case $1
when 'Milli'
event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot.to_f / 1_000).to_iso8601
when 'Micro', 'Nano'
# For now we'll stick to assuming ntp timestamps,
# Netscaler implementation may be buggy though:
# https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=11047
# This only affects the fraction though
ntp_seconds = (v.snapshot >> 32) & 0xFFFFFFFF
ntp_fraction = (v.snapshot & 0xFFFFFFFF).to_f / 2**32
event[@target][k.to_s] = LogStash::Timestamp.at(Time.utc(1900,1,1).to_i + ntp_seconds, ntp_fraction * 1000000).to_iso8601
end

Thanks for looking at this!

A very similar code path is taken for the Cisco v9 however.

when SWITCHED
millis = flowset.uptime - v
seconds = flowset.unix_sec - (millis / 1000)
# v9 did away with the nanosecs field
micros = 1000000 - (millis % 1000)
event[@target][k.to_s] = LogStash::Timestamp.at(seconds, micros).to_iso8601

In the YAF IPFIX case, we have a division and a timestamp conversion (line 388). In the netflow case, we have more arithmetic operations and a timestamp conversion. Nonetheless, the netflow data scales well.

I've created a new benchmark script that compares the two. https://github.com/logstash-plugins/logstash-codec-netflow/blob/master/spec/codecs/benchmarks/flowStartMilliseconds.rb

From that benchmark it seems v.snapshot is solely responsible for the difference in performance.
Other differences like .to_f, or the regex matches on the field names don't seem to contribute to the difference.
It runs each above (simplified) snippet 2 million times,

implementation result
IPFIX timestamp snippet with v.snapshot 4.7 sec
IPFIX timestamp snippet without v.snapshot 2.4 sec
Netflow9 timestamp snippet original 2.7 sec

Giving it some more thought, the gains are only 2 seconds for 1 million flows.
Like you already suggested this can't be causing of the performance difference.

Circling back to this issue... perhaps the performance discrepancy can be explained by the huge number of templates YAF sends? Every time a tpl comes in, we use a mutex to synchronise access to it. Which also applies to data packets because they need access to the template. This makes it a non-parallellizable operation, perhaps explaining your flat scaling curve.

I don't think the number of templates can be the issue. When we run the benchmark script, we see similarly poor performance as we do when running YAF outright. Yet, the benchmark script only sends the template once.

Hmm yes good point!