replicatedhq/troubleshoot

Possible improvements to Velero collector/analyzer

xavpaice opened this issue · 4 comments

Describe the rationale for the suggested feature.

For the Replicated team, a number of support issues raised are associated with Velero. The information in the Velero analyzer is useful, but not quite complete.

I would like to review the collector/analyzer for Velero, to see what improvements can be made that would have the most impact on our being able to solve support issues faster.

See https://github.com/vmware-tanzu/velero/issues/new?assignees=&labels=&projects=&template=bug_report.md for the kind of things that Velero themselves ask for information.

If we are able to produce a useful support bundle and analysis, there's also an opportunity to discuss adding this to the Velero project as a diagnostic tool to help the maintainers.

First step:

  • ensure we collect all the info requested in a Velero bug report, in the Troubleshoot Velero collector
  • Review support issues and find what info we needed to diagnose that, and what analysis could have helped highlight the issue earlier

Second step:

  • write individual issues for updates to Troubleshoot

Velero has a velero debug command already which collects a bunch of information.

The definition of done here is to:

  • review the Velero analyzer in Troubleshoot
  • check if there's missing information that would be helpful
  • check that the analyzer is useful to our troubleshooting efforts with support issues
  • produce detailed issues/stories for any changes that we recommend

Current info collected by velero debug bundle

[gerard@gerard-kurl ~]$ velero debug --backup instance-ggs98
2024/03/12 01:11:39 Collecting velero resources in namespace: velero
2024/03/12 01:11:40 Collecting velero deployment logs in namespace: velero
2024/03/12 01:11:40 Collecting log and information for backup: instance-ggs98
2024/03/12 01:11:41 Generated debug information bundle: /home/gerard/bundle-2024-03-12-01-11-39.tar.gz
[gerard@gerard-kurl ~]$ tar -tzf /home/gerard/bundle-2024-03-12-01-11-39.tar.gz
velero-bundle
velero-bundle/backup_describe_instance-ggs98.txt
velero-bundle/backup_instance-ggs98.log
velero-bundle/kubecapture
velero-bundle/kubecapture/core_v1
velero-bundle/kubecapture/core_v1/velero
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b/node-agent
velero-bundle/kubecapture/core_v1/velero/node-agent-5k98b/node-agent/node-agent.log
velero-bundle/kubecapture/core_v1/velero/pods-202403120111.6465.json
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-kurl-util
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-kurl-util/replicated-kurl-util.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-local-volume-provider
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/replicated-local-volume-provider/replicated-local-volume-provider.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero/velero.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-aws
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-aws/velero-velero-plugin-for-aws.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-gcp
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-gcp/velero-velero-plugin-for-gcp.log
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-microsoft-azure
velero-bundle/kubecapture/core_v1/velero/velero-854f967b7f-btw9q/velero-velero-plugin-for-microsoft-azure/velero-velero-plugin-for-microsoft-azure.log
velero-bundle/kubecapture/velero.io_v1
velero-bundle/kubecapture/velero.io_v1/velero
velero-bundle/kubecapture/velero.io_v1/velero/backuprepositories-202403120111.2620.json
velero-bundle/kubecapture/velero.io_v1/velero/backups-202403120111.2612.json
velero-bundle/kubecapture/velero.io_v1/velero/backupstoragelocations-202403120111.2617.json
velero-bundle/kubecapture/velero.io_v1/velero/podvolumebackups-202403120111.2621.json
velero-bundle/kubecapture/velero.io_v1/velero/serverstatusrequests-202403120111.2623.json
velero-bundle/version.txt

Data collected are:

  • Velero client and server version
  • Velero CRDs definition
  • Velero deployment logs
  • Logs and describe for specific backup (if --backup flag is provided). The describe is verbosed and include resource list
  • Logs and describe for specific restore (if --restore flag is provided). The describe is verbosed and include resource list

This data is sufficient to troubleshoot related to Velero backup/restore of snapshots.

Noters on current Velero analyzer in Troubleshoot

  • required to set velero/logs as name for pod log collector
  • sample checks
Check PASS
Title: At least 1 Backup Repository configured
Message: Found 1 backup repositories configured and 1 Ready

------------
Check PASS
Title: Velero Logs analysis for kind [node-agent*]
Message: Found 1 log files

------------
Check WARN
Title: Velero logs for pod [/tmp/supportbundle3307708783/support-bundle-2024-03-12T05_13_27/velero/logs/velero-854f967b7f-btw9q/velero.log]
Message: Found error|panic|fatal in velero* pod log file(s)

------------
Check PASS
Title: Velero Logs analysis for kind [velero*]
Message: Found 6 log files

------------
Check PASS
Title: Velero Backups
Message: Found 2 backups

------------
Check PASS
Title: At least 1 Backup Storage Location configured
Message: Found 1 backup storage locations configured and 1 Available

------------
Check PASS
Title: Pod Volume Backups
Message: Found 1 pod volume backups

------------
Check PASS
Title: Velero Status
Message: Velero setup is healthy

------------