Ingesting IPFIX Flows From YAF Oddly Slow

I've been benchmarking the netflow codec for a tool that stitches IPFIX/ netflow data together into bi-directional sessions. The codec seems to run well when ingesting data from SonicWall and Cisco devices. However, the codec runs oddly slow when ingesting flows produced by YAF.

Test Machine Specifications:

DigitalOcean CPU Optimized Instance
32GB of RAM
16 vCPUs
16 udp workers
64MB UDP buffer (rmem_max)
LS_JAVA_OPTS=-Xms8g -Xmx8g
Null logstash output
logstash-codec-netflow version 4.1.0
logstash version 6.2.4
XPack Monitoring

Known benchmarks:

netflow_bench_cisco_asr.py: 18200 flows/ second
netflow_bench_cisco_asa.py: 11500 flows/ second

First, I ran yaf with the following command: yaf --uniflow --in test.pcap --out my.logstash.ip --ipfix-port 2055 --ipfix udp and saw that I was only processing 3000 flows/ second.

I did some packet captures and saw that about one fifth of the data was templates. In order to make a better comparison to known benchmarks, I created a similar script (had to change file extension to txt).

With the script, the codec manages 4000 flows/ second.

Running yaf with --silk doesn't seem to dramatically change the performance characteristics. I didn't create a test script for silk style flows, but when running the tool, adding --silk resulted in 200 less flows/ second on average.

I know it is not IPFIX specific since I created a similar test script (can be provided) for flows based on SonicWall's IPFIX implementation. The SonicWall IPFIX benchmark managed 17500 flows/ second.

I know it is not template-processing related since the benchmark only sends the template once.

The YAF flows encode timestamps as flowStartMilliseconds which need a simple conversion to iso8601, while the sonicwall flows only include systemInit relative timestamps, so they don't need a conversion. I thought this may be the cause, but netflow v9 also requires the timestamp conversion on the cisco data sets. The cisco data sets perform fine.

The YAF flows do have a subTemplateList but that should be efficiently handled since the field is labeled :skip in the definitions file.

I thought the data size of each flow may affect the speed at which the flows are processed. However, the YAF benchmark bundles 25 flows per packet with an overall packet size of 1415 bytes, while the cisco asr benchmark bundles 21 flows per packet with an overall packet size of 1392 bytes. The cisco asr actually sends larger flows (~66 bytes) to logstash than YAF does (~56 bytes). Yet, YAF is processed much slower.

The YAF flows do include an enterprise field, but the field flowAttributes is well accounted for in the definitions file.

I've looked through the code and I can't work out what is causing the slow down. However, I'm not too good at ruby to be honest.

The script also includes the wireshark descriptions of the template and data packets.

After more benchmarking, it seems as if processing YAF flows doesn't scale vertically.

The following was captured using Digital Ocean CPU-Optimized Droplets while running MongoDB, Logstash, Elasticsearch, and Kibana. Elasticsearch and Kibana were needed in order to collect the data.

The system was benchmarked with the scripts here as well as with two yaf commands run directly on a sample pcap file.

The system caps out at around 13000 flows/ second due to the MongoDB output plugin and database in the pipeline.

codec version	RAM	-xmx -xms GB	udp buffer MB	vCPU	Workers	CPU Optimized	Mongo	YAF flows /sec	sonicWALL IPFIX flows/ sec	sonicWALL v9 flows/ sec	Cisco ASA flows/ sec	Cisco ASR flows/ sec	yaf sample.pcap --no-tombstone --no-stats --silk flows/ sec	yaf sample.pcap --no-tombstone --no-stats flows/ sec
4.1.0	32	8	64	16	16	1	1	4000	13250	11500	11500	13500	2850	3000
4.1.0	16	6	64	8	8	1	1	3600	10000	6750	11100	11750	2500	2800
4.1.0	16	6	64	6*	6	1	1	3500	8700	5600	10000	12900	2500	2650
4.1.0	16	6	64	4*	4	1	1	3100	6200	3750	6800	9200	2100	2350
4.1.0	16	6	64	2*	2	1	1	2350	3200	1900	3600	4750	1550	1650

* CPU limited via docker cpu scheduler on 8 core machine

EDIT: Added more details for each benchmark

	YAF flows /sec	sonicWALL IPFIX flows/ sec	sonicWALL v9 flows/ sec	Cisco ASA flows/ sec	Cisco ASR flows/ sec
Flows/ packet	25	5	5	14	21
Packet Size	1415	285	289	1452	1392
Rough Avg Flow Size	56.6	57	57.8	103.7142857	66.28571429

The YAF flows use flowStartMilliseconds (152) and flowEndMilliseconds (153) fields (the Cisco flows don't use them) so take a slightly other path in the code:

logstash-codec-netflow/lib/logstash/codecs/netflow.rb

Lines 382 to 397 in f131bcf

    
           case k.to_s 
        
           when /^flow(?:Start|End)Seconds$/ 
        
             event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot).to_iso8601 
        
           when /^flow(?:Start|End)(Milli|Micro|Nano)seconds$/ 
        
             case $1 
        
             when 'Milli' 
        
               event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot.to_f / 1_000).to_iso8601 
        
             when 'Micro', 'Nano' 
        
               # For now we'll stick to assuming ntp timestamps, 
        
               # Netscaler implementation may be buggy though: 
        
               # https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=11047 
        
               # This only affects the fraction though 
        
               ntp_seconds = (v.snapshot >> 32) & 0xFFFFFFFF 
        
               ntp_fraction = (v.snapshot & 0xFFFFFFFF).to_f / 2**32 
        
               event[@target][k.to_s] = LogStash::Timestamp.at(Time.utc(1900,1,1).to_i + ntp_seconds, ntp_fraction * 1000000).to_iso8601 
        
             end

Thanks for looking at this!

A very similar code path is taken for the Cisco v9 however.

logstash-codec-netflow/lib/logstash/codecs/netflow.rb

Lines 297 to 302 in f131bcf

    
           when SWITCHED 
        
             millis = flowset.uptime - v 
        
             seconds = flowset.unix_sec - (millis / 1000) 
        
             # v9 did away with the nanosecs field 
        
             micros = 1000000 - (millis % 1000) 
        
             event[@target][k.to_s] = LogStash::Timestamp.at(seconds, micros).to_iso8601

In the YAF IPFIX case, we have a division and a timestamp conversion (line 388). In the netflow case, we have more arithmetic operations and a timestamp conversion. Nonetheless, the netflow data scales well.

I've created a new benchmark script that compares the two. https://github.com/logstash-plugins/logstash-codec-netflow/blob/master/spec/codecs/benchmarks/flowStartMilliseconds.rb

From that benchmark it seems v.snapshot is solely responsible for the difference in performance.
Other differences like .to_f, or the regex matches on the field names don't seem to contribute to the difference.
It runs each above (simplified) snippet 2 million times,

implementation	result
IPFIX timestamp snippet with `v.snapshot`	4.7 sec
IPFIX timestamp snippet without `v.snapshot`	2.4 sec
Netflow9 timestamp snippet original	2.7 sec

Giving it some more thought, the gains are only 2 seconds for 1 million flows.
Like you already suggested this can't be causing of the performance difference.

Circling back to this issue... perhaps the performance discrepancy can be explained by the huge number of templates YAF sends? Every time a tpl comes in, we use a mutex to synchronise access to it. Which also applies to data packets because they need access to the template. This makes it a non-parallellizable operation, perhaps explaining your flat scaling curve.

I don't think the number of templates can be the issue. When we run the benchmark script, we see similarly poor performance as we do when running YAF outright. Yet, the benchmark script only sends the template once.

Hmm yes good point!

	case k.to_s
	when /^flow(?:Start\|End)Seconds$/
	event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot).to_iso8601
	when /^flow(?:Start\|End)(Milli\|Micro\|Nano)seconds$/
	case $1
	when 'Milli'
	event[@target][k.to_s] = LogStash::Timestamp.at(v.snapshot.to_f / 1_000).to_iso8601
	when 'Micro', 'Nano'
	# For now we'll stick to assuming ntp timestamps,
	# Netscaler implementation may be buggy though:
	# https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=11047
	# This only affects the fraction though
	ntp_seconds = (v.snapshot >> 32) & 0xFFFFFFFF
	ntp_fraction = (v.snapshot & 0xFFFFFFFF).to_f / 2**32
	event[@target][k.to_s] = LogStash::Timestamp.at(Time.utc(1900,1,1).to_i + ntp_seconds, ntp_fraction * 1000000).to_iso8601
	end

	when SWITCHED
	millis = flowset.uptime - v
	seconds = flowset.unix_sec - (millis / 1000)
	# v9 did away with the nanosecs field
	micros = 1000000 - (millis % 1000)
	event[@target][k.to_s] = LogStash::Timestamp.at(seconds, micros).to_iso8601