microsoft/Docker-Provider

Agent cannot onboard

Closed this issue · 17 comments

Hello,

I had created this issue last week in OMS-Agent-for-Linux repo, but I guess it should be created here. Sorry for cross-repo issue.

This issue is about Azure Monitor for containers agent running on Azure Arc enabled Kubernetes clusters, as described in this page.

The agents fail to onboard with the below log:
checkAgentOnboardingStatus giving up checking agent onboarding status after 31 secs

This is a VMware TKG 2.2 cluster running on AWS VMC.

Note, on a Kubernetes cluster created via cluster API on Azure, agents just successfully onboard and start collecting logs and metrics.

Complete logs are below:

Defaulted container "ama-logs" out of: ama-logs, ama-logs-prometheus
customResourceId:/subscriptions/68b52005-df94-4f64-9e5a-b4f6a58b88a2/resourceGroups/landing-zone-azure-arc/providers/Microsoft.Kubernetes/ConnectedClusters/ams1-cncr-prod
customRegion:westeurope
****************Start Config Processing********************
****************Start NPM & subnet ip usage integrations Config Processing********************
config::integrations::Successfully substituted the placeholders for integrations into /etc/opt/microsoft/docker-cimprov/telegraf.conf file for DaemonSet
config::integrations::Successfully substituted the integrations placeholders into /etc/opt/microsoft/docker-cimprov/telegraf.conf file for DaemonSet
Making curl request to oms endpint with domain: opinsights.azure.com
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl request to oms endpoint succeeded.
MARINER VERSION="2.0.20230526"
Azure mdsd: 1.26.1-build.master.97
telegraf 1.26.0-2.cm2
DOCKER_CIMPROV_VERSION=18.0.1-0
Fluent Bit v2.0.9
Git commit:
fluentd 1.14.6
****************Start Config Processing********************
Both stdout & stderr log collection are turned off for namespaces: '*_kube-system_*.log'
****************End Config Processing********************
config::Starting to substitute the placeholders in fluent-bit.conf file for log collection
config::Successfully substituted the placeholders in fluent-bit.conf file
config::Starting to substitute the placeholders in fluent-bit-common.conf file for log collection
config::Successfully substituted the placeholders in fluent-bit-common.conf file
****************Start Agent Integrations Config Processing********************
****************Start Prometheus Config Processing********************
config::No configmap mounted for prometheus custom config, using defaults
****************End Prometheus Config Processing********************
****************Start MDM Metrics Config Processing********************
****************End MDM Metrics Config Processing********************
****************Start Metric Collection Settings Processing********************
****************End Metric Collection Settings Processing********************
MUTE_PROM_SIDECAR = false
Making wget request to cadvisor endpoint with port 10250
Using port 10250
Making curl request to cadvisor endpoint /pods with port 10250 to get the configured container runtime on kubelet
configured container runtime on kubelet is : containerd
set caps for ruby process to read container env from proc
ams1-cncr-prod-md-0-k5sfq-5d4cd4f46-4sfx9
*** setting up oneagent in legacy auth mode ***
setting mdsd workspaceid & key for workspace:19cb4dda-53b1-40ee-9bcd-924da2f568a6
starting mdsd in main container...
setting up cronjob for ci agent log rotation
*** starting fluentd v1 in daemonset
starting fluent-bit and setting telegraf conf file for daemonset
using fluentbitconf file: fluent-bit.conf for fluent-bit
since container run time is containerd update the container log fluentbit Parser to cri from docker
nodename: ams1-cncr-prod-md-0-k5sfq-5d4cd4f46-4sfx9
replacing nodename in telegraf config
checking for listener on tcp #25226 and waiting for 45 secs if not..
File Doesnt Exist. Creating file...
Fluent Bit v2.0.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

Routing container logs thru v2 route...
waitforlisteneronTCPport found listener on port:25226 in 2 secs
checking for listener on tcp #25228 and waiting for 120 secs if not..
waitforlisteneronTCPport found listener on port:25228 in 19 secs
2023-07-17T14:36:44Z I! Loading config file: /etc/opt/microsoft/docker-cimprov/telegraf.conf
checkAgentOnboardingStatus giving up checking agent onboarding status after 31 secs
startup script took: 66 seconds
pfrcks commented

@raftAtGit can you run the troubleshooting script mentioned here

@pfrcks I cannot. this is a stripped down Ubuntu distro, there is no even sudo or apt-get commands to install the prerequisites.
any suggestions?

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

/Remove stale label

pfrcks commented

@raftAtGit can you collect logs using this script: https://github.com/microsoft/Docker-Provider/blob/ci_prod/scripts/troubleshoot/LogCollection/README.md

you can do this from a seperate vm where you can install the prerequisites and copy the cluster kubeconfig

AKSInsights-logs.1691574215.DESKTOP-AOHJ94O.zip
(extracted original tgz file and zipped since GitHub does not accept tgz files)

Below is the output of the script.

Preparing for log collection...
Prerequisites check complete!
Saving cluster information...
cluster info saved to Tool.log
Collecting logs from ama-logs-4vscr...
Defaulted container "ama-logs" out of: ama-logs, ama-logs-prometheus
Collecting the following logs from ama-logs-4vscr:
/var/opt/microsoft/docker-cimprov/log | Containers ama-logs, ama-logs-prometheus
/var/opt/microsoft/linuxmonagent/log | Containers ama-logs, ama-logs-prometheus
/etc/mdsd.d/config-cache/configchunks/ | Data Collection Rule Config
Collecting the following logs from ama-logs-4vscr:
/etc/fluent/container.conf | Containers ama-logs, ama-logs-prometheus
Collecting the following logs from ama-logs-4vscr:
/etc/opt/microsoft/docker-cimprov/fluent-bit.conf | Containers ama-logs, ama-logs-prometheus
/etc/opt/microsoft/docker-cimprov/telegraf.conf | Containers ama-logs, ama-logs-prometheus
Complete log collection from ama-logs-4vscr!
Windows agent pod does not exist, skipping log collection for windows agent pod
Collecting logs from ama-logs-rs-587f48c568-4dfk7...
Collecting the following logs from ama-logs-rs-587f48c568-4dfk7:
/var/opt/microsoft/docker-cimprov/log
/var/opt/microsoft/linuxmonagent/log
Collecting the following logs from ama-logs-rs-587f48c568-4dfk7:
/etc/fluent/kube.conf
Collecting the following logs from ama-logs-rs-587f48c568-4dfk7:
/etc/opt/microsoft/docker-cimprov/fluent-bit-rs.conf
/etc/opt/microsoft/docker-cimprov/telegraf-rs.conf
Complete log collection from ama-logs-rs-587f48c568-4dfk7!
Collecting onboarding logs...
Collecting deployment info...
configMap named container-azm-ms-configmap is not found, if you created configMap for ama-logs, please manually save your custom configMap of ama-logs by command: kubectl get configmaps <configMap name> --namespace=kube-system -o yaml > configMap.yaml
configMap named container-azm-ms-aks-k8scluster is not found, if you created configMap for ama-logs, please manually save your custom configMap of ama-logs by command: kubectl get configmaps <configMap name> --namespace=kube-system -o yaml > configMap.yaml
Collecting ama-logs-rs-config configmap...
If syslog collection is enabled please make sure that the node pool image is Nov 2022 or later.        To check current version and upgrade: https://learn.microsoft.com/en-us/azure/aks/node-image-upgrade
Complete onboarding log collection!

Archiving logs...
log files have been written to AKSInsights-logs.1691574215.DESKTOP-AOHJ94O.tgz in current folder

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

/Remove stale label

pfrcks commented

From the describe*.txt files we can see that some essential mounts are failing


  Warning  FailedMount  2m7s  kubelet            MountVolume.SetUp failed for volume "osm-settings-vol-config" : failed to sync configmap cache: timed out waiting for the condition
  Warning  FailedMount  2m7s  kubelet            MountVolume.SetUp failed for volume "ama-logs-secret" : failed to sync secret cache: timed out waiting for the condition

@raftAtGit can you confirm that you followed the steps as mentioned here for the legacy/non-managed identity scenario?

Also did you first connect the cluster to Azure Arc before trying to install our extension?

@raftAtGit can you confirm that you followed the steps as mentioned here for the legacy/non-managed identity scenario?

correct

Also did you first connect the cluster to Azure Arc before trying to install our extension?

that is also correct.

this setup just worked on a cluster created with cluster API on Azure.
it just doesn't work on VMware TKG 2.2 cluster for some reason.

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

/Remove stale label

This issue is stale because it has been open 7 days with no activity. Remove stale label or comment or this will be closed in 5 days.

This issue was closed because it has been stalled for 12 days with no activity.

@raftAtGit , Below error indicates that, you have issue with cluster. ama-logs-secret is critical to get the agent onboarded. because its failing to mount hence agent failing to send the data, you would need to fix the cluster.

Warning FailedMount 2m7s kubelet MountVolume.SetUp failed for volume "osm-settings-vol-config" : failed to sync configmap cache: timed out waiting for the condition
Warning FailedMount 2m7s kubelet MountVolume.SetUp failed for volume "ama-logs-secret" : failed to sync secret cache: timed out waiting for the condition

@ganga1980 what do you mean by fix the cluster?

we basically enable azuremonitor feature on Azure Arc-enabled kubernetes as described in this page, we have no control over what actions it is taking, and what is happening in the cluster

again, this setup just worked on a cluster created with cluster API on Azure.
it just doesn't work on VMware TKG 2.2 cluster for some reason.

Hi, @raftAtGit - Below errors indicates that, there is kubelet issue to mount these volumes. can you please re-enable and see if it works?

Warning FailedMount 2m7s kubelet MountVolume.SetUp failed for volume "osm-settings-vol-config" : failed to sync configmap cache: timed out waiting for the condition
Warning FailedMount 2m7s kubelet MountVolume.SetUp failed for volume "ama-logs-secret" : failed to sync secret cache: timed out waiting for the condition