mspnp/aks-baseline

Flux CrashLoop and Cannot execute Kubectl Logs

scaswell-hirez opened this issue ยท 16 comments

Bug:
Flux fails to deploy properly

Expected Result:
Flux launches correctly and deploys the remainder of /cluster-manifests as described

Actual Result:
When deploying flux as described in https://github.com/mspnp/aks-secure-baseline/blob/main/06-gitops.md it crashes on startup and enters a CrashLoopBackOff.

There is no indication as to why this is happening and any attempt to retrieve logs from the pod with kubectl logs results in:

Error from server (InternalError): Internal error occurred: Authorization error (user=masterclient, verb=get, resource=nodes, subresource=proxy)

The same error occurs when trying to retrieve logs from the healthy memcached pod.

This is the view of the error from the live logs screen in the portal.
image

Since flux is responsible for deploying everything else in /cluster-manifests this is something of a hard stop regarding the published baseline directions.

Reproduction:
Follow the steps to cleanly deploy the aks-baseline as written. The flux deployment fails every time.

Can you get the logs for flux from Log Analytics directly and let me know what the crash backoff is?

I don't know what's happening with the Internal Error with your ProxyCall. But you should still be able to access the logs via direct container log query in Log Analytics. That'll help you triage the flux deployment crash. We have not seen that one before, so will be interesting to know what situation you're in that might be causing that.

So far no. The only logs available in Log Analytics are from gatekeeper. I can only assume this is because flux is crashing before even emitting a log message.

Here is my Log Analytics query

let startTimestamp = ago(24h);
KubePodInventory
| where TimeGenerated > startTimestamp
| project ContainerID, PodName=Name
| distinct ContainerID, PodName
| join
(
    ContainerLog
    | where TimeGenerated > startTimestamp
)
on ContainerID
// at this point before the next pipe, columns from both tables are available to be "projected". Due to both 
// tables having a "Name" column, we assign an alias as PodName to one column which we actually want
| project TimeGenerated, PodName, LogEntry, LogEntrySource
| order by TimeGenerated desc
| where PodName contains "flux"

I'm open to any suggestions on how to diagnose what's happening with flux.

In my latest attempt the error message has changed when attempting to retrieve logs:

$> kubectl logs -p -n cluster-baseline-settings flux-58bc97776f-s7qwn

Error from server: Get "https://10.240.0.8:10250/containerLogs/cluster-baseline-settings/flux-58bc97776f-s7qwn/flux?previous=true": write unix @->/tunnel-uds/socket: write: broken pipe

If it's crashing before logs get emitted, then you can check KubeEvents as well to see if it's represented in there. If you kubectl describe (or maybe even kubectl get) the flux-58bc97776f-s7qwn pod, you should see those as well I would imagine from that view.

Last State: Terminated
Reason: Error
Exit Code: 1
Started: Tue, 16 Nov 2021 13:06:18 -0500
Finished: Tue, 16 Nov 2021 13:06:48 -0500

Unfortunately this doesn't seem very helpful in determining why it's crashing.

Name:         flux-58bc97776f-gjzxx
Namespace:    cluster-baseline-settings
Priority:     0
Node:         aks-npuser01-28140417-vmss000000/10.240.0.8
Start Time:   Tue, 16 Nov 2021 12:59:03 -0500
Labels:       app.kubernetes.io/name=flux
              pod-template-hash=58bc97776f
Annotations:  prometheus.io/port: 3031
Status:       Running
IP:           10.240.0.28
IPs:
  IP:           10.240.0.28
Controlled By:  ReplicaSet/flux-58bc97776f
Containers:
  flux:
    Container ID:  containerd://232cbfa5a1e46b322cc2bbf133fa08ef3efe46f94aed69a5a53169cf7ba57187
    Image:         acraksz34lbvktfjgby.azurecr.io/fluxcd/flux:1.21.1
    Image ID:      acraksz34lbvktfjgby.azurecr.io/fluxcd/flux@sha256:9d0879f2f1fd033051c0f02048b012e406277e198586462e66faafe862ff09da
    Port:          3030/TCP
    Host Port:     0/TCP
    Args:
      --git-url=https://github.com/scaswell-hirez/aks-secure-baseline-single-region.git
      --git-branch=main
      --git-path=cluster-manifests
      --git-readonly
      --sync-state=secret
      --listen-metrics=:3031
      --git-timeout=5m
      --registry-disable-scanning=true
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 16 Nov 2021 13:06:18 -0500
      Finished:     Tue, 16 Nov 2021 13:06:48 -0500
    Ready:          False
    Restart Count:  7
    Requests:
      cpu:        50m
      memory:     64Mi
    Liveness:     http-get http://:3030/api/flux/v6/identity.pub delay=5s timeout=5s period=10s #success=1 #failure=3
    Readiness:    http-get http://:3030/api/flux/v6/identity.pub delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/fluxd/ssh from git-key (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdx5d (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  git-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  flux-git-deploy
    Optional:    false
  kube-api-access-qdx5d:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              agentpool=npuser01
                             kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  8m5s  default-scheduler  Successfully assigned cluster-baseline-settings/flux-58bc97776f-gjzxx to aks-npuser01-28140417-vmss000000

You can try to see if there are logs using the container id 232cbfa5a1e46b322cc2bbf133fa08ef3efe46f94aed69a5a53169cf7ba57187 -- I know your query tried to pick them up via fuzzy name match.

Sorry you're having issues with this. We deployed this last Thursday end-to-end, so functionally we know it works; but maybe you're running into some case that we didn't test; which is why I'm appreciative of you helping narrow down what's happening in your cluster in case we can improve things in the walkthrough.

No results. I've redeployed flux again so the ContainerID has changed from above. This ran for the last 24 hours.

ContainerLog 
| where ContainerID == "81385a3adef9c4ea350495391e5ea4f6eaa8ab944bf44f032eab0bc90d928f37"
| order by TimeGenerated desc 
| project TimeGenerated, ContainerID, LogEntry

I appreciate the help in trying to track this down. I'm hoping to be able to get this implemented for a large upcoming project.

No logs, bummer. Are there any unhealth pods in kube-system by chance? Just trying to figure out why you are having problems communicating to the cluster (kubectl logs) and why you're not getting content in your log analytics workspace. Seems like some network traffic is being blocked. You can also check your firewall logs to see if there is any unexpected Deny traffic in there.

Solved it!! Per step 3 in https://github.com/mspnp/aks-secure-baseline/blob/main/05-aks-cluster.md

Deploy the cluster ARM template.
โ— By default, this deployment will allow unrestricted access to your cluster's API Server. 
You can limit access to the API Server to a set of well-known IP addresses 
(i.,e. a jump box subnet (connected to by Azure Bastion), build agents, or any other networks
you'll administer the cluster from) by setting the clusterAuthorizedIPRanges parameter in all 
deployment options.

I included a list of IPs to restrict traffic to our office. Apparently this also blocks traffic from inside the cluster to the API Server. This makes sense upon review. I opened it up in an effort to determine what was going on and all of a sudden everything started working. By authorizing 10.240.0.0/16, the address space reserved for the cluster nodes, I can restrict public access to the API while still allowing the cluster to communicate.

I apologize. I had forgotten I had included this optional step in my setup.

Whoa, that's really good to know. I don't have much experience with the nuance of that setting. Typically I deploy private clusters (like AKS Baseline for regulated), which doesn't even support setting that value. TIL as well :) I wonder if we could improve the directions around that optional setting to make it clear about what should also be included.

In this case I've opted to include the entire address space for the spoke because it also seems to block getting logs and other necessary commands. I suggest something like this as a starting point to update the step

โ— By default, this deployment will allow unrestricted access to your cluster's API Server.
You can limit access to the API Server to a set of well-known IP addresses
(i.,e. a jump box subnet (connected to by Azure Bastion), build agents, or any other networks
you'll administer the cluster from) by setting the clusterAuthorizedIPRanges parameter in all
deployment options. Be aware this will also restrict access from inside the cluster to the API Server.
Be sure to authorize any cluster address spaces that require API access. This implementation
requires 10.240.0.0/16 to have access to the API server.

Thanks for the suggestion, again, @scaswell-hirez. Quality of life improvements along the way make it easier for the next person that needs to use this content. Much appreciated.

Update, apparently just opening 10.240.0.0/16 isn't enough for flux. Apologies I spoke to soon. I'm not sure what range needs to be authorized to make flux work yet, but further testing has revealed that authorizing all of the following still result in flux failing to launch.

10.240.0.0/16
10.200.0.0/24
172.16.0.0/16
172.18.0.0/16

I haven't found any other CIDRs in the project to open up and test. I've also noticed that enabling authorized IPs after deploying flux will cause a failure with kured and flux after a certain amount of time.

image

Did you get a chance to view the text that I just added? You need to make sure that you include all your public IPs (from your egress Firewall). That's the part that we didn't have called out in this walkthrough's optional step.

I just tested it out and yes only including the firewall's public IP allowed flux to launch as expected. Thanks!

Thanks for helping make this content better, @scaswell-hirez! ^5