tinkerbell/charts

Tink not responding to worker's GRPCs.

Closed this issue · 2 comments

I filed this in the tink repo, but maybe it's more k8s-centric?

I just tried the updates from a couple weeks ago and it seems to be the same issue, but I haven't done the GRPC tcpdumps again...

tinkerbell/tink#780

I have tink set up with the helm chart. I use metallb instead of kubevip. Boots uses layer2 and tink/everything else uses bgp. The tink server seems to receive the gRPC message, but doesn't seem to respond. I've attached a pcap from the point of veiew of the tink-server container. There don't seem to be any logs of what's happening in the tink-server container, and the nginx container just shows a HTTP2 POST to the GetWorkflowContexts endpoint.

Expected Behaviour

I'd expect the workflow to get transmitted over.

Current Behaviour

The machine constantly asks for GetWorkflowContexts and doesn't do anything. The docker container is just running on the target worker and asks once every 5-10 seconds.

Steps to Reproduce (for bugs)

  1. Install Helm Chart to tink-system namespace
  2. Add hardware/template/workflow/bmc-hardware/bmc-secrets to tink-system namespace
  3. Reboot target machine with PXE (manually, as rufio wasn't working the last time I tried, but I haven't tried in a while).
  4. Check that tink-worker docker container stays up
  5. Check logs for GetWorkflowContexts

Your Environment

  • Operating System and version (e.g. Linux, Windows, MacOS): Debian Linux

  • How are you running Tinkerbell? Using Vagrant & VirtualBox, Vagrant & Libvirt, on Packet using Terraform, or give details:
    Home kubernetes cluster.

  • Link to your project or a code example to reproduce issue:
    https://github.com/ClashTheBunny/tink-settings contains my current settings
    I'm using https://github.com/ClashTheBunny/tinkerbell-charts as a git submodule such that the file://../../ references those. I only just tried to update to v0.9.0 which neccessetated this update, but I was using the upstream when trying the default versions in the helm chart.
    Here's a pcap using v0.9.0 everywhere, as in the above tink-settings and charts:
    tink-dump.pcap.gz

192.168.32.43 - - [31/Jul/2023:13:48:56 +0000] "GET /vmlinuz-x86_64 HTTP/1.1" 200 11599840 "-" "iPXE/1.0.0+"
192.168.32.43 - - [31/Jul/2023:13:49:14 +0000] "GET /initramfs-x86_64 HTTP/1.1" 200 198301654 "-" "iPXE/1.0.0+"
192.168.32.43 - - [31/Jul/2023:13:49:55 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"
192.168.32.43 - - [31/Jul/2023:13:50:01 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"
192.168.32.43 - - [31/Jul/2023:13:50:07 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"
192.168.32.43 - - [31/Jul/2023:13:50:13 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"
192.168.32.43 - - [31/Jul/2023:13:50:19 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"
192.168.32.43 - - [31/Jul/2023:13:50:25 +0000] "POST /proto.WorkflowService/GetWorkflowContexts HTTP/2.0" 200 0 "-" "grpc-go/1.56.2"

The worker_id transmitted can be seen in the tcpdump, but /proc/cmdline coroborates this:

(ns: getty) super0:~# cat /proc/cmdline
ip=dhcp tink_worker_image=quay.io/tinkerbell/tink-worker:v0.9.0 facility=sandbox syslog_host=192.168.32.233 grpc_authority=192.168.255.255:42113 tinkerbell_tls=false worker_id=00:25:90:a7:f9:9e hw_addr=00:25:90:a7:f9:9e modules=loop,squashfs,sd-mod,usb-storage intel_iommu=on iommu=pt initrd=initramfs-x86_64 console=tty0 console=ttyS1,11520

The problem was actually that the controller was having trouble parsing some go templating. The errors were present in the logs of the controller.