keikoproj/instance-manager

Memory leak in pagination of DescribeInstanceTypes

eytan-avisror opened this issue · 4 comments

We make the call DescribeInstanceTypes on every reconcile, but cache the result for up to 24hrs.
The payload of this call is very large, and usually comes in 3-4 full pages and hardly ever changes over time (when new instance types are added, or existing instance types are changed).

It seems our caching library aws-sdk-go-cache is leaking the paginated calls, and in this case causing a memory leak (because the call will always paginate and is massive payload).

The caching library creates a cache per operation, e.g. ec2.DescribeInstanceTypes is it's own cache, and stores the data with request.Params as key, and response.Body as value, meaning that a call with input X will always have output Y.

Since in the case for pagination, subsequent calls reference NextToken to access next pages, and NextToken is always a random string, this means we keep leaking these pages in the cache until the pod eventually gets OOMKilled.

Fix for pagination in the caching library is not likely since we cannot uniquify the parameters in the case for pagination.

This leaves following options:

  • Check in a file with the payload - will have to re-validate every release whether there are changes.
  • Make the call on controller load time - this will force all users to make this call regardless of whether they are using the eks provisioner or other, which may be unnecessary
  • Find a way to avoid pagination when making the call by using filters
  • Come up with an innovative fix in the https://github.com/keikoproj/aws-sdk-go-cache for the pagination problem

Either way, we should also reduce the cache MaxSize to avoid blowing up the memory, currently its set to 5000 items and prune 500 when it reaches max, we should probably go with a much smaller number e.g. 500/50 or even 300/30 since this MaxSize number is per operation cache and due to existing TTLs, these numbers should be appropriate (should test).

Screen Shot 2021-03-29 at 2 06 17 PM
Screen Shot 2021-03-29 at 4 34 10 PM

Does this sound like a bug in general for https://github.com/keikoproj/aws-sdk-go-cache in that it doesn't properly cache paginated requests?

Does this sound like a bug in general for https://github.com/keikoproj/aws-sdk-go-cache in that it doesn't properly cache paginated requests?

Thats exactly what it is, the problem is it's more of a design flaw, and I cannot seem to find a solution for making paginated calls not leak since the NextToken provided by AWS is always random

A potential solution, although requires development, could be to switch to aws-sdk-v2 which allows for "middleware" so you could write your own cachehandler (middleware) for this specific problem.

https://aws.github.io/aws-sdk-go-v2/docs/middleware/

OK, seems we were able to come up with a reasonable solution here for a fix in the caching library.
We misunderstood the problem a bit, the caching part worked just fine - we could still access the extra pages via cache since the payload of first call has the next token that leads you to the next cached page, but the problem is with TTL actually.

Whenever the TTL expires, the library will only overwrite that item when the same key is set, but after the first item expires and we get a new set of pages, the older few pages are leaked and nothing can prune them because that key is never used again.

example, lets say TTL is 1 minute and we have the following cache:

         { }          |   { NextToken: x }  |  # First page
  { NextToken: x }    |   { NextToken: y }  |  # Second page
  { NextToken: y }    |   { NextToken: z }  |  # Third page

After TTL expires, and a new call is made, the first page will be replaced, but extra pages will be added:

         { }          |   { NextToken: a }  |  # First page
  { NextToken: a }    |   { NextToken: b }  |  # Second page
  { NextToken: b }    |   { NextToken: c }  |  # Third page
  { NextToken: x }    |   { NextToken: y }  |  # Second page of previous call (leaked)
  { NextToken: y }    |   { NextToken: z }  |  # Third page of previous call (leaked)

and every TTL time this will keep growing with the extra pages that are not cleaned up.
So a good solution is actually a background delete of items with expired TTL - if we add a goroutine that does this cleanup we should not end up in situations where we blow up the memory.