traefik/mesh

Unable to install Maesh on AWS EKS v1.17 due to a CoreDNS issue

0rax opened this issue · 7 comments

0rax commented

Bug Report

What did you do?

Installed traefik-maesh from Helm on a AWS EKS v1.17 (eks.3) cluster with Calico networing using

helm repo add traefik-mesh https://helm.traefik.io/mesh
helm repo update
helm install traefik-mesh traefik-mesh/traefik-mesh

What did you expect to see?

I was expecting the controller to start and maesh to be working.

What did you see instead?

The traefik-maesh-controller pod went into CrashLoopBackOff due to an issue with the traefik-maesh-prepare container. The issue seems to be linked to the "CoreDNS" version not being compatible with maesh though it should be (CoreDNS 1.3+).

Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  13m                   default-scheduler  Successfully assigned default/traefik-mesh-controller-5f48ff8f69-vrbd9 to xxx.compute.internal
  Normal   Pulled     11m (x5 over 13m)     kubelet            Container image "traefik/mesh:v1.4.0" already present on machine
  Normal   Created    11m (x5 over 13m)     kubelet            Created container traefik-mesh-prepare
  Normal   Started    11m (x5 over 13m)     kubelet            Started container traefik-mesh-prepare
  Warning  BackOff    2m51s (x49 over 13m)  kubelet            Back-off restarting failed container

Output of prepare container log: (traefik/mesh:v1.4.0)

2020/10/28 19:16:35 command prepare error: unable to find suitable DNS provider: unsupported CoreDNS version "1.6.6-eksbuild.1"

What is your environment & configuration (arguments, provider, platform, ...)?

  • Kubernetes version: v1.17.9-eks-a84824
  • EKS version: v1.17-eks.3
  • Calico version: v3.16.4
  • Maesh version: v1.4.0

@0rax Thanks for your interest in Traefik Mesh!

It appears that the issue comes from one of our dependencies: https://github.com/hashicorp/go-version.
Before patching the DNS configuration we make sure CoreDNS is between >= 1.3 and < 1.8. But go-version constrains considers that a version with a pre-release never matches with a constrain specified without a pre-release.

An issue is already open on their repository to understand why it behave like this: hashicorp/go-version#59

Until this get sorted, we can replace the goversion.NewConstraint(">= 1.3, < 1.8") by a version.GreaterThanOrEqual and version.LessThan. In this type of comparison pre-releases are handled correctly.

0rax commented

Thank you for your quick answer, seems like an issue that could be easily fixed.

I will try to build a custom version of the docker-image with this fix to properly check Maesh compatibility with my setup.

@0rax Could you base your changes on v1.4? Since it's a bug fix it would be great to have it on this version.
Don't hesitate to ping me if you need help on this.

0rax commented

It looks like that using this patch on top of refs/tags/v1.4.0 I was able to start traefik-mesh successfully.

diff --git a/pkg/dns/dns.go b/pkg/dns/dns.go
index c62d46d..0416b87 100644
--- a/pkg/dns/dns.go
+++ b/pkg/dns/dns.go
@@ -39,7 +39,11 @@ const (
        traefikMeshBlockTrailer = "#### End Traefik Mesh Block"
 )
 
-var versionCoreDNS17 = goversion.Must(goversion.NewVersion("1.7"))
+var (
+       versionCoreDNS17 = goversion.Must(goversion.NewVersion("1.7"))
+       versionCoreDNS13 = goversion.Must(goversion.NewVersion("1.3"))
+       versionCoreDNS18 = goversion.Must(goversion.NewVersion("1.8"))
+)
 
 // Client holds the client for interacting with the k8s DNS system.
 type Client struct {
@@ -103,7 +107,7 @@ func (c *Client) coreDNSMatch(ctx context.Context) (bool, error) {
                return false, err
        }
 
-       if !versionConstraint.Check(version) {
+       if !(version.GreaterThanOrEqual(versionCoreDNS13) && version.LessThan(versionCoreDNS18)) {
                c.logger.Debugf("CoreDNS version is not supported, must satisfy %q, got %q", versionConstraint, version)
 
                return false, fmt.Errorf("unsupported CoreDNS version %q", version)

Quick note, I just had to create a namespace myself as the current helm chart seems to install it in the default namespace by default, this seams inconsistent with the documentation available here https://doc.traefik.io/traefik-mesh/install/#verify-your-installation where it says to check the installation using the traefik-mesh namespace.


For people interested about how I was able to deploy it after patching the code, I had to launch the following commands:

make
docker tag traefik/mesh:latest XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh:v1.4.0-eks
docker push XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh:v1.4.0-eks
echo "---
apiVersion: v1
kind: Namespace
metadata:
    name: traefik-mesh" | kubectl apply -f -
helm install traefik-mesh traefik-mesh/traefik-mesh \
    --set controller.image.pullPolicy=IfNotPresent \
    --set controller.image.name=XXXXXXX.dkr.ecr.eu-west-3.amazonaws.com/traefik-mesh \
    --set controller.image.tag=v1.4.0-eks \
    --namespace=traefik-mesh

@0rax This patch sounds good 👍

Could you please open a Pull Request to contribute the changes upstream? We will make sure to release a patch version on the v1.4.

Thanks again for your time on this.

0rax commented

@jspdown Just pushed it, I took the liberty to rename global variables to something that better match what they do instead of what they are and added a test case reflecting this issue.

Closed by #774.