nginxinc/nginx-service-mesh

nginx-mesh-metrics 0.7.0 TLS handshake error

tanvp112 opened this issue · 14 comments

This component has reported the following error every few seconds and recorded in the logs:

echo: http: TLS handshake error from :: remote error: tls: bad certificate

The TLS error also cause kubectl api-resources to fail with "unable to retrieve the complete list of server APIs"

Peek into kube-api log:

E0105 10:03:45.159812 1 controller.go:116] loading OpenAPI spec for "v1alpha1.metrics.smi-spec.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: error trying to reach service: x509: certificate signed by unknown authority (possibly because of "x509: ECDSA verification failure" while trying to verify candidate authority certificate "NGINX")

Looks like the nginx-mesh-metrics cert cannot validate (or rejected, self-signed?) by kubernetes... is there a workaround for this? Is this related to Spire?

@tanvp112 This has been logged and is going through triage. Thanks for the submission and we'll have an update shortly.

Is this reproducible?

It could be related spire server, but also with this particular deployment of NSM. I suspect there may be a race between spire server communicating the CA_BUNDLE to the cluster and creation of the APIService.

Does the APIService have a value set for the .spec.caBundle field?

kubectl describe nginx-mesh-api and didn't spot any CA bundle specified, didn't see any specific CA settings in /etc/config/mesh-config.yaml as well. If I have looked into wrong place please let me know.

This is stock installation using nginx-meshctl with mTLS turn-off follows the installation document.

By the way, it will be GREAT if more info about how SPIRE works in NSM can be publish ...

Sorry, I should've been a little more explicit.

kubectl get apiservices v1alpha1.metrics.smi-spec.io -o yaml

There will be a .spec.caBundle field. We suspect this will be unset in your deployment.

If so, we believe you're being affected by a race with Spire - there is current work being designed to close this race and in turn improve both products. The only effective workaround is to remove and redeploy at this point.

It's our policy to not detail too many aspects of third-party dependencies. This is an overall benefit, as it ensures that our documentation doesn't diverge from the downstream product and we don't erroneously discuss out-of-date features that would lead to greater confusion. Spire documentation is very good and informative, https://spiffe.io/docs/latest/spiffe/overview/.

If there are places you'd like us to more explicitly mention how to find more information we'd appreciate the suggestion.

Here's a the output:

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  creationTimestamp: "2021-01-03T07:09:22Z"
  labels:
    app.kubernetes.io/name: nginx-mesh-metrics
    app.kubernetes.io/part-of: nginx-service-mesh
  name: v1alpha1.metrics.smi-spec.io
  resourceVersion: "431984"
  uid: 2497f32a-d5c4-4ac3-89aa-13f1cbd0ad50
spec:
  caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUJqVENDQVRLZ0F3SUJBZ0lCQURBS0JnZ3Foa2pPUFFRREFqQWRNUXN3Q1FZRFZRUUdFd0pWVXpFT01Bd0cKQTFVRUNoTUZUa2RKVGxnd0hoY05NakV3TVRBek1EY3dPRFU1V2hjTk1qRXdPREF4TURjd09UQTVXakFkTVFzdwpDUVlEVlFRR0V3SlZVekVPTUF3R0ExVUVDaE1GVGtkSlRsZ3dXVEFUQmdjcWhrak9QUUlCQmdncWhrak9QUU1CCkJ3TkNBQVJPczV0NENrb0taYlhxMStCY08xWVFsUERiR2U5aVFxVUs2cC9mMjFFazBBb3lQUFNJRHZ3c3VPekQKeXUxbE0vbnJRNytjaWo3NE94NXRPby8zSUtJYW8yTXdZVEFPQmdOVkhROEJBZjhFQkFNQ0FZWXdEd1lEVlIwVApBUUgvQkFVd0F3RUIvekFkQmdOVkhRNEVGZ1FVRUZjbEZTcnNoZWJUM3VNSUFmRlBlVWJlUElVd0h3WURWUjBSCkJCZ3dGb1lVYzNCcFptWmxPaTh2WlhoaGJYQnNaUzV2Y21jd0NnWUlLb1pJemowRUF3SURTUUF3UmdJaEFQWlQKZWRjRVZEbnlNMHAzZWVkMjFPeElkLzRORGNqUnJvdnptdzBYNld5WEFpRUFqK1VDU3A1MDJMYkIyV3ZoazRBcwpnMVNRenI3bkRFSDdaODgrSkM1dU1OST0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
  group: metrics.smi-spec.io
  groupPriorityMinimum: 100
  service:
    name: nginx-mesh-metrics-svc
    namespace: nginx-mesh
    port: 443
  version: v1alpha1
  versionPriority: 100
status:
  conditions:
  - lastTransitionTime: "2021-01-16T02:39:27Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available

caBundle translate into:

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
    Signature Algorithm: ecdsa-with-SHA256
        Issuer: C=US, O=NGINX
        Validity
            Not Before: Jan  3 07:08:59 2021 GMT
            Not After : Aug  1 07:09:09 2021 GMT
        Subject: C=US, O=NGINX
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:4e:b3:9b:78:0a:4a:0a:65:b5:ea:d7:e0:5c:3b:
                    56:10:94:f0:db:19:ef:62:42:a5:0a:ea:9f:df:db:
                    51:24:d0:0a:32:3c:f4:88:0e:fc:2c:b8:ec:c3:ca:
                    ed:65:33:f9:eb:43:bf:9c:8a:3e:f8:3b:1e:6d:3a:
                    8f:f7:20:a2:1a
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Certificate Sign, CRL Sign
            X509v3 Basic Constraints: critical
                CA:TRUE
            X509v3 Subject Key Identifier:
                10:57:25:15:2A:EC:85:E6:D3:DE:E3:08:01:F1:4F:79:46:DE:3C:85
            X509v3 Subject Alternative Name:
                URI:spiffe://example.org
    Signature Algorithm: ecdsa-with-SHA256
         30:46:02:21:00:f6:53:79:d7:04:54:39:f2:33:4a:77:79:e7:
         76:d4:ec:48:77:fe:0d:0d:c8:d1:ae:8b:f3:9b:0d:17:e9:6c:
         97:02:21:00:8f:e5:02:4a:9e:74:d8:b6:c1:d9:6b:e1:93:80:
         2c:83:54:90:ce:be:e7:0c:41:fb:67:cf:3e:24:2e:6e:30:d2

Please correct me where I am wrong; the error is due to the SPIRE CA cert which signed the nginx-mesh API server cert; is not accessible (or simply not available) for Kubernetes to verify the nginx-mesh API server cert during HTTPS exchange?

I respect the policy... if you would agree one of the top issue in running service mesh is knowing how each component is setup specifically to function together. Say in this case, we probably don't need to know head to toe about SPIRE, but it will be great if the Reference section on NSM Docs can outline how SPIRE is configured by default (by nginx-meshctl) very specifically for NSM working. For me it's all about time to resolution, a key KPI for real production run. We want to have greater confident to run this mesh in real production.

Thank you for the documentation suggestion. We have scheduled work to greatly improve our docs and will incorporate more concepts and configuration parameters.

We have work scheduled to fix this issue. It is somewhat dependent on some upstream changes we've submitted to Spire, but those may not make their 1.0 release.

It can be encountered if the APIService .spec.caBundle is not set, or if the CA expires. In either case the APIService can be updated with the proper data.

To workaround this problem:

  1. Make the --mtls-ca-ttl a large value to reduce the period in which it's encountered (not a production recommendation)
  2. Manually rotate the APIService CA bundle
    i. kubectl -n nginx-mesh get cm spire-bundle -o "jsonpath={.data['bundle\.crt']}" | base64 -i -
    ii. kubectl edit apiservice v1alpha1.metrics.smi-spec.io
    iii. Update the .spec.caBundle with the base64 encoded output from step (i)

Until we can release our fix this process needs to be done on the same period as the --mtls-ca-ttl

For the workaround, I followed your instructions to rotate the caBundle & restart services, unfortunately the error persist. Removed and upgraded to 0.8.0 it works as long as the control plane is not restarted. If restarted the error will appear, then rotate the CA bundle doesn't seem to help. A redeploy will remove the error.

I look forward to the permanent fix and more documentation (the nats-server...).

This change has been bounded by Spire releases. We're incorporating the proper Spire release for our v1.1 release.

We'll shortly be adding more Architecture level documentation in the next few days.

Excellent, looking forward to it!

It's related with SPIRE and I believe it's caused by Helm chart (second apply upgrade --install)
This is caused by an inconsistency between the Secret (tls.key/tls.crt) and the running spire-server & spire-agents pods.
Delete the spire server and agent pods in order and the problem will go away.

I've experienced this, and some more problems (like the namespace job) with ArgoCD where I needed to ignore changes on the tls.key/tls.crt as well as the admission webhook CA Bundle.

In the case of the namespace, the job fails to apply labels because they are already existing. A quick workaround was ||exit 0 on the kubectl, which make some.. assumptions.

Overall the helm chart needs to be polished and improved a bit, got around a PR that after i'm done with it I'll send it for review :)

It's related with SPIRE and I believe it's caused by Helm chart (second apply upgrade --install)
This is caused by an inconsistency between the Secret (tls.key/tls.crt) and the running spire-server & spire-agents pods.
Delete the spire server and agent pods in order and the problem will go away.

I've experienced this, and some more problems (like the namespace job) with ArgoCD where I needed to ignore changes on the tls.key/tls.crt as well as the admission webhook CA Bundle.

In the case of the namespace, the job fails to apply labels because they are already existing. A quick workaround was ||exit 0 on the kubectl, which make some.. assumptions.

Overall the helm chart needs to be polished and improved a bit, got around a PR that after i'm done with it I'll send it for review :)

Can you provide some steps so that we can reproduce it on our end?

simple, install the mesh with helm. then go and change a value and upgrade it with helm.

@xavipanda This particular bug was opened well before we had Helm charts. What you experienced may be a different issue. If you think that's the case, please open a separate ticket with the issues you saw.

Hey @xavipanda I do notice that error message when using helm. It can happen if the nginx-mesh-metrics Pod comes up before SPIRE is up. Once SPIRE comes up everything should work fine. Can you test if the metrics works? You can just do nginx-meshctl top after running some traffic and paste the output here.