kubernetes-sigs/aws-iam-authenticator

heptio-authenticator occassionally returns Unauthorized when used in a kubernetes go client

gprasad84 opened this issue · 16 comments

Issue:

I have a sample HTTP server that listens on /. When / is called, the handler simply authenticates with eks cluster using the Heptio-authenticator-module and returns to the user the number of pods running in that cluster.For authentication I use the kubernetes go client.
Also note that for every new request on / I create a new kuberentes config so there is no config reuse between requests.

I observed that after ~15 min the eks authentication fails with a Unauthorized error which is unexpected as I am creating generating a new kuberenets config object for every request which should result in a token generated for every request.
Also the immediate request to / (after unauthorized request) seems to be successful again which is more confusing.

Attached is a snippet of my code and logs

  1. First request: success
    curl -v -s localhost:9000/
    log output:
There are 2 pods in the cluster
  1. Second request: success
    curl -v -s localhost:9000/
    log output:
There are 2 pods in the cluster
  1. Third request after ~15 min: failed
    curl -v -s localhost:9000/
    log output:
2018/08/10 14:10:27 http: panic serving [::1]:63678: Unauthorized
goroutine 11 [running]:
  1. Fourth request: success
    curl -v -s localhost:9000/
    log output:
There are 2 pods in the cluster
package main

import (
  //"flag"
  "fmt"
  //"path/filepath"
  //"time"
  metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

  //"k8s.io/apimachinery/pkg/api/errors"
  "k8s.io/client-go/kubernetes"
  //_ "k8s.io/client-go/plugin/pkg/client/auth/oidc"
  restclient "k8s.io/client-go/rest"
  capi "k8s.io/client-go/tools/clientcmd/api"
  "net/http"
)

var (
  endpoint string = "REDACTED"
)

func getClientSet() *kubernetes.Clientset {
  awsenv := capi.ExecEnvVar{
    Name:  "AWS_PROFILE",
    Value: "dv",
  }
  config := &restclient.Config{
    Host: endpoint,
    TLSClientConfig: restclient.TLSClientConfig{
      //CAData: clusterCert,
      CAFile: "REDACTED",
    },
    ExecProvider: &capi.ExecConfig{
      Command:    "heptio-authenticator-aws",
      Args:       []string{"token", "-i", "dev-v1"},
      APIVersion: "client.authentication.k8s.io/v1alpha1",
      Env:        []capi.ExecEnvVar{awsenv},
    },
  }
  clientset, err := kubernetes.NewForConfig(config)
  if err != nil {
    panic(err.Error())
  }
  return clientset
}

func getClusterCredentials(w http.ResponseWriter, r *http.Request) {
  clientset := getClientSet()
  fmt.Println(clientset)
  pods, err := clientset.CoreV1().Pods("local").List(metav1.ListOptions{})
  if err != nil {
    panic(err.Error())
  }
  fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
  fmt.Fprintf(w, "There are %d pods in the cluster\n", len(pods.Items))
}

func main() {
  http.HandleFunc("/", getClusterCredentials)
  http.ListenAndServe(":9000", nil)

  
}

I believe my app is experiencing something similar. Some noteworthy points from what I'm observing:

  • IAM role authentication doesn't demonstrate this flakiness, only IAM user authentication
  • We also observe around a ~15 minute window after which authentication fails
  • We also observe that after another call (or two) after authentication failure, the requests seem to be successful again

Let me know if there's anything else that would be helpful to provide, although the code here seems to capture what we observe quite well!

Yes very strange. For now my workaround is to retry the request if I get Unauthorized response and is working alright so far.

var done bool
for i := 0; i < 2; i++ {
    pods, err = clientset.CoreV1().Pods("local").List(metav1.ListOptions{})
    if err != nil {
      if strings.Contains(err.Error(), "Unauthorized") {
        fmt.Printf("%v", err.Error())
        continue
      } else {
        break
      }
    } else {
      done = true
      break
    }
  }
  if !done {
    panic(err.Error())
  }

Since you are implementing the client, you need to try again once you get unauthorized. You can see client-go do it here: https://github.com/kubernetes/kubernetes/pull/59495/files#diff-0861e1cd492e078057f9ec7524d9d6ffR182.

@nckturner Not sure I quite follow, what part of the client is "implemented" here? Seems like it's using client-go.

Ah sorry. Not paying attention. I find it interesting that you see a difference between role and user authentication... Let me see if I observe this as well.

Thanks! @gprasad84 could you report if your case re-produces the issue for user, role, or both types of authn? Would be a helpful datapoint as well I think.

@nckturner just wondering if you had a chance to look into this and/or were able to observe this. Any help would be greatly appreciated!

@wesleyk Not yet, hoping to get to it by the end of the week!

Are there any updates on this issue?

@nckturner ^ any updates to provide?

@wesleyk @sagikazarmark Sorry for the lag on this issue, I was out for some time. After revisiting this, let me give you the explanation of what I think you are seeing. The token that is generated initially is cached by client-go, and it persists in a global cache, keyed by the exec config. It will persist until it expires, which is 15 minutes. When it expires, it is still sent to the API server, and receives a 401 in response. Client-go refreshes the credential here, but still returns the 401, which is the unauthorized you are seeing. On the next invocation, the newly refreshed credential is used, without error. This is by design. Since your implementation is a web server, the cache persists during the lifetime of the server. This doesn't explain differences between users and roles, though. The repro that I have is currently using a role.

There is another scenario (in EKS) where the token lasts for 21 minutes instead of 15 minutes, due to caching on the API server side. Could this explain the differences you were seeing between users and roles?

ah-ha! We don't use role-based authentication on our server-side, so that would explain the difference in behavior on our end.

Thanks for the help here! It may be nice if client-go recognized that the token has expired and to eagerly refresh it. Is the idea that client-go is agnostic to the expiration time?

No, client-go does recognize expiration time, we just don't pass it along. I'll take a stab at a PR for it though!

This is still happening for me. I get this when using a role not the user. It happens pretty intermittently and retrying after a few seconds seems to work. The specific error is 403 Forbidden "could not get token: AccessDenied: Access denied", any help would be appreciated because this is currently making our testing builds flaky

@nakulpathak3 are you on v0.4.0-alpha.1? We've stopped seeing any sort of flaky auth errors since updating to the release with @nckturner's change

Yea I just realized that I believe we are not so I'm trying to upgrade right now and see if it fixes it, I'll report back. Thanks for getting back so quickly! @wesleyk