Ever growing set of Faraday objects

Faraday instances built through HttpClient are never garbage collected as a reference is always retained (unless reset_agents! is explicitly called)

azure-storage-ruby/common/lib/azure/storage/common/core/http_client.rb

Lines 32 to 39 in df9c2cb

    
           def agents(uri) 
        
             key = uri.to_s 
        
             @agents ||= {} 
        
             unless @agents.key?(key) 
        
               @agents[key] = build_http(uri) 
        
             end 
        
             @agents[key] 
        
           end

@agents should be restricted to a set number of keys to prevent memory leaks.

We are employing the following workaround:

# frozen_string_literal: true

require "azure/storage"
require "azure/storage/core/http_client"

module AzureStorageMemoryLeakFix
  # https://github.com/Azure/azure-storage-ruby/blob/df9c2cba646cb4685b93c4cb6462e17efa70529b/common/lib/azure/storage/common/core/http_client.rb#L32-L39
  # Azure caches a Faraday instance for every URI that is accessed,
  # leading to a memory leak.
  #
  # This keeps track of the URIs that are cached and limits it to
  # only MAX_FARADAY_INSTANCES instances.

  MAX_FARADAY_INSTANCES = 10

  def agents(uri)
    super.tap do
      clean_agents(uri)
    end
  end

  private

  def clean_agents(uri)
    @cached_agents ||= []
    @cached_agents.unshift(uri)

    return if @cached_agents.size <= MAX_FARADAY_INSTANCES

    to_remove = @cached_agents.pop
    @agents.delete(to_remove)
  end
end

Azure::Storage::Core::HttpClient.prepend AzureStorageMemoryLeakFix

Thanks for reporting this issue.
It is very difficult for SDK to know when to garbage collect the agents. However, instead of asking the customer to do work-around, we can open API that allow customer to call reset_agents! on demand.
This is marked as an enhancement of the SDK and logged into our backlog.

Can I ask the benefit of keeping a separate Faraday instance for every URI?
Would it make more sense to keep a Faraday instance per host (which can be re-used for any paths on that host)?

As it stands, a user downloading 100 files will end up with 100 Faraday instances. Even with a public method to reset_agents!, a user would have to find this issue to know that they should call reset_agents! after every file (or when they are done downloading).

Would you mind explaining why it is necessary to cache Faraday instances at all?

Having done more digging into this, this is more severe than it first seems.
(I'm well out of my depth here, so some of this I may have misunderstood but can prove empirically)

Faraday calls out to OpenSSL and loads the default system certificate store:

https://github.com/lostisland/faraday/blob/54c5c2e96c99d1f72aa1adc9d4634b14e52dedbf/lib/faraday/adapter/httpclient.rb#L112

Under the hood, this calls out to X509_STORE_set_default_paths. By the docs, this is misnamed and actually loads certificates:

X509_STORE_set_default_paths() is somewhat misnamed, in that it does not set what default paths should be used for loading certificates. Instead, it loads certificates into the X509_STORE from the hardcoded default paths.

These stores end up taking up a significant amount of memory. While we hold Faraday references in this cache, we see memory usage grow very quickly (in the order of MB per request).

Maintaining a connection pool is generally a good approach for HTTP client, and the current implementation is a crude one without considering the upper limits. It allows the connection to be reused since apply for a connection from system is comparitively an expensive operation. It is also the approach in Rails and other market leading competitors' SDKs.
The ideal approach is to use at least a LRU cache to maintain the connections pool rather than just a queue, since there is no guarantee in queue that the popped entry is the least popular entry.

BTW, loading certs for each connection seems to be a bug in Faraday, can you also open an issue in their repo to track this?

Maintaining a connection pool is generally a good approach for HTTP client, and the current implementation is a crude one without considering the upper limits. It allows the connection to be reused since apply for a connection from system is comparitively an expensive operation.

This makes sense - thank you - and confirms that it is an optimisation only (there's no requirement to keep connections)

It is also the approach in Rails and other market leading competitors' SDKs.

I'm not aware of an HTTP connection pool in Rails, but I guess the equivalent would be Rails' ActiveRecord::ConnectionAdapters::ConnectionPool. The significant difference is that Rails maintains a pool of connections per host, while the current implementation here appears to be per URI (i.e. equivalent to adding a new connection to the pool for each query).

I feel a better approach might be to maintain a connection for each host, rather than for each uri?
This would allow us to more often make use of the object, leading to the intended speed improvement and a reduction in the number of objects we have a reference to? Given the number of hosts we're likely to be connecting to here, we could probably keep the existing naive implementation?

Something along the lines of:

def agents(uri) 
   uri = URI.parse(uri) unless uri.is_a? URI
   key = uri.host 
   @agents ||= {} 
   unless @agents.key?(key) 
     @agents[key] = build_http(uri) 
   end 
   @agents[key] 
 end

Is this something you'd consider a PR for?

loading certs for each connection seems to be a bug in Faraday, can you also open an issue in their repo to track this?

I feel like this may be correct behaviour - I'd expectFaraday::Connection to be self-contained.
I'll open an issue there regardless for discussion - it is a rather large thing to be loading each time.

^ Faraday issue opened for discussion 🙂

I guess another option would be to use the cert_store option in Faraday to make sure we only have a single certificate store that is shared between all of azure-storage-ruby's connections?

I feel a better approach might be to maintain a connection for each host, rather than for each uri?

This is actually a good solution at the moment, and we welcome you to do a PR for this code change.

I guess another option would be to use the cert_store option in Faraday to make sure we only have a single certificate store that is shared between all of azure-storage-ruby's connections?

This only resolves the overhead introduced by cert loading but not the unlimited connections. It is a mitigation not a solution but it is also something worth optimizing.

Thanks for the PR and we released this in azure-storage-common 2.0.1.

	def agents(uri)
	key = uri.to_s
	@agents \|\|= {}
	unless @agents.key?(key)
	@agents[key] = build_http(uri)
	end
	@agents[key]
	end