Exception raised when running too many get requests

Question

Exception raised when running too many get requests

binarytemple opened this issue a year ago · 6 comments

Guessing that k8s is spawning the gke-gcloud-auth-plugin command every time a get request request is issued.

I'm running something like this:

with_services_added  = ingress_list  |> 
  Enum.map(
    fn i  -> 
      ret = Task.async(
        fn -> 
          operation = K8s.Client.get("v1", "Service", 
            namespace: i.namespace, 
            name: i.service_name)
          {:ok, service }  = K8s.Client.run(conn, operation)

          i |> 
          Map.put(:selector, service["spec"]["selector"] |> then(fn m -> "|#{Kernel.inspect(m)}|" end) )  |>
          Map.put(:project, service["metadata"]["annotations"]["gitlab.binarytemple.com/projectname"]  ) 

        end)
      ret
    end) |>
  Enum.map(&Task.await(&1))

  CSV.encode(with_services_added, headers: true, delimiter: "") |> Enum.each(&IO.puts(&1))

And receiving the following error message:

10:40:12.830 [error] GenServer #PID<27964.2401.0> terminating
** (stop) :emfile
    :erlang.open_port({:spawn_executable, '/Users/bryanh/brew/share/google-cloud-sdk/bin/gke-gcloud-auth-plugin'}, [{:env, []}, :use_stdio, :exit_status, :binary, :hide, {:args, []}])
    (elixir 1.14.3) lib/system.ex:1071: System.do_cmd/3
    lib/k8s/conn/auth/exec.ex:87: K8s.Conn.Auth.Exec.generate_token/1
    lib/k8s/conn/auth/exec.ex:69: K8s.Conn.RequestOptions.K8s.Conn.Auth.Exec.generate/1
    lib/k8s/conn.ex:361: K8s.Conn.RequestOptions.K8s.Conn.generate/1
    lib/k8s/discovery/driver/http.ex:58: K8s.Discovery.Driver.HTTP.http_get/2
    lib/k8s/discovery/driver/http.ex:17: K8s.Discovery.Driver.HTTP.resources/3
    #cell:bq4txyso2sm2jmyzennytxcbnv4rdczy:10: (file)
Last message: {:DOWN, #Reference<27964.1503083199.3897032710.249364>, :process, #PID<27964.2400.0>, {:emfile, [{:erlang, :open_port, [erl_child_setup: failed with error 32 on line 281
{:spawn_executable, '/Users/bryanh/brew/share/google-cloud-sdk/bin/gke-gcloud-auth-plugin'}, [{:env, []}, :use_stdio, :exit_status, :binary, :hide, {:args, []}]], [error_info: %{module: :erl_erts_errors}]}, {System, :do_cmd, 3, [file: 'lib/system.ex', line: 1071]}, {K8s.Conn.Auth.Exec, :generate_token, 1, [file: 'lib/k8s/conn/auth/exec.ex', line: 87]}, {K8s.Conn.RequestOptions.K8s.Conn.Auth.Exec, :generate, 1, [file: 'lib/k8s/conn/auth/exec.ex', line: 69]}, {K8s.Conn.RequestOptions.K8s.Conn, :generate, 1, [file: 'lib/k8s/conn.ex', line: 361]}, {K8s.Discovery.Driver.HTTP, :http_get, 2, [file: 'lib/k8s/discovery/driver/http.ex', line: 58]}, {K8s.Discovery.Driver.HTTP, :resources, 3, [file: 'lib/k8s/discovery/driver/http.ex', line: 17]}, {:elixir_eval, :__FILE__, 1, [file: '#cell:bq4txyso2sm2jmyzennytxcbnv4rdczy', line: 10]}]}}

There's 104 requests being shot out in quick succession/concurrently there, which I might expect to cause HTTP client failures, but what I suspect is happening is that 104 instances of the gke-gcloud-auth-plugin executable are being spawned.

Answer 1 · 2024-01-20T22:52:35.000Z

Yeah, the Exec auth strategy doesn't do any type of caching or anything so ends up spawning quite a lot of processes to generate tokens. We're running into a similar situation with generating tokens for EKS clusters.

FWIW, both the v1beta1 and v1 response have an (optional) expiration timestamp. However, it doesn't look like there's a really ergonomic way to maintain that state between calls.

I'm happy to add that functionality but would want a bit of guidance on how we'd want to do that and what would be acceptable before getting too far (e.g. ETS, GenServer state, etc).

cc: @coryodaniel @elliottneilclark

Answer 2 · 2024-01-21T23:00:36.000Z

Hey thanks for the reply, long time Erlang/Elixir dev. People get hung up on pure functions etc, but typically what I see used for this kind of thing is an ETS table, yep, global state, impure, etc, but Erlang is a pragmatic platform first and functional second. You might be tempted to use a point cut approach and add a caching library but honestly, for something like this, a public ETS table to store the tokens and fetch from every time one was needed or invalidated would do a fine job. If you pointed me at the appropriate place in the code path I could whip a PR up for you next weekend if you liked.

Answer 3 · 2024-01-22T05:25:58.000Z

I would like to suggest a GenServer per K8s.Connection with a DynamicSupervisor for this first. We can always add an ETS table if the token read throughput is too slow.

Agents are best used when there's a single state; that's hardly ever changed, and we don't need to specialize access. That doesn't fit here since we will have some expiry at the very least.

GenServer is the default to reach for when we have state and need more. Additionally, K8s already spins up a dynamic supervisor and a GenServer per connection for adapters Here. So following that we can add a DynamicSupervisor for Auth to use. Then, we can start a process during AuthProvider.create/2. Using the process, we get a single writer even if the K8s.Connection is shared across processes. Additionally, we can schedule an early refresh (with jitter) of the token in a Task that sends back. Assuming the command being called is quick (for some configurable amount of expected shell/exec time), no requests to Kubernetes should see the latency hit of shelling out.

Looking at K8s.Conn.Auth.Azure and K8s.Conn.Auth.Certificate would benefit from following this pattern using HTTP and file io, respectively. K8s.Conn.Auth.Certificate can use this improvement the most since certificates do expire, and many cloud providers are giving shorter and shorter certificate lengths. For high-security clusters, this can get pretty extreme in CA expiry time.

Answer 4 · 2024-01-22T09:01:48.000Z

I have noticed this as well. I fully agree with @elliottneilclark's approach. We can refresh the token "automatically" in the GenServer some time before it expires. IMO it might also be enough to just check the expiry timestamp when using the token. If it is expired, it needs to be renewed. Yes, this one request will take longer. But I wonder if we ever even reach that point as these auth providers are mostly used locally. But you tell me.

That being said, is anybody willing to implement this?

Answer 5 · 2024-01-22T19:37:27.000Z

That being said, is anybody willing to implement this?

We will. I'll start with the exec and put it up for review.

Answer 6 · 2024-01-22T23:38:00.000Z

#302 is my start at this. I or @JTarasovic still need to do a lot before the entire feature is complete.