logstash-plugins/logstash-codec-netflow

Memory leak in Netflow::TemplateRegistry when cache_save_path parameter is not provided

123joshuawu opened this issue · 0 comments

Logstash information:

Version: 7.16.3

JVM (e.g. java -version):

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-8u342-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

TLDR: if cache_save_path is not provided, Netflow::TemplateRegistry does not call do_cleanup which is in charge of cleaning up the Vash memory caches.

In our testing, logstash heap memory usage would continually increase until it would crash with an out of memory exception. This would happen around every four hours in our environment.

Comparing heap dumps within those four hours, we noticed the memory usage of an object grow over 4x. (Right side is baseline, left is dump from oom crash)
Screen Shot 2023-03-20 at 10 11 38 AM

Opening up the object, we can determine the class name from the metadata.
Screen Shot 2023-03-19 at 2 44 33 PM

We trace this back to the corresponding source code:

class TemplateRegistry
##
# @param logger [Logger]
# @param ttl [Integer]
# @param file_path [String] (optional)
def initialize(logger, ttl, file_path=nil)
@logger = logger
@ttl = Integer(ttl)
@file_path = file_path
@mutex = Mutex.new
@bindata_struct_cache = Vash.new
@bindata_spec_cache = Vash.new
do_load unless file_path.nil?
end

In the heap dump screenshot var2 and var5 correspond with the two instances of Vash used in the TemplateRegistry. From our testing, the memory usage of these two objects were continuously growing.

Looking at the Vash implementation, we can see that it requires a manual cleanup call in order to release memory.
https://gist.github.com/joshaven/184837

The Vash object will forget any answer that is requested after the specified
TTL. It is a good idea to manually clean things up from time to time because
it is possible that you'll cache data but never again access it and therefor
it will stay in memory after the TTL has expired. To clean up the Vash object,
call the method: cleanup!

In TemplateRegistry, the cleanup call for both Vash objects are made in the TemplateRegistry::do_cleanup method.

##
# @see `TemplateRegistry#cleanup`
# @api private
def do_cleanup!
@bindata_spec_cache.cleanup!
@bindata_struct_cache.cleanup!
end

do_cleanup is then only ever called in do_persist

def do_persist
return if file_path.nil?
logger.debug? and logger.debug('Writing templates to template cache', :file_path => file_path)
fail('Template Cache not writable') if File.exists?(file_path) && !File.writable?(file_path)
do_cleanup!
templates_cache = @bindata_spec_cache
File.open(file_path, 'w') do |file|
file.write(templates_cache.to_json)
end
rescue Exception => e
logger.error('Template Cache could not be saved', :file_path => file_path, :exception => e.message)
end

However, note that on line 644, if file_path is not provided, then the do_persist function exits early, hence skipping the call to do_cleanup.

file_path can then be traced back to the cache_save_path setting in the initialization of the TemplateRegistry.

@netflow_templates = TemplateRegistry.new(logger, @cache_ttl, @cache_save_path && "#{@cache_save_path}/netflow_templates.cache")
@ipfix_templates = TemplateRegistry.new(logger, @cache_ttl, @cache_save_path && "#{@cache_save_path}/ipfix_templates.cache")

Thus we can see that this situation happens when a value is not provided for cache_save_path, setting file_path to nil by default causing do_cleanup to always get skipped.

Steps to reproduce:

Provide logs (if relevant):