hashicorp/consul-k8s

Upcoming breaking changes to Consul on Kubernetes: Enabling of service mesh by default and disabling of node-level client agents from Consul Service Mesh on Kubernetes and Catalog Sync

david-yu opened this issue · 14 comments

By Q4 2022, the Consul on Kubernetes product will be looking to make some changes to the default deployment of Consul of Kubernetes to align with our target use case around zero trust networking with Service Mesh.

The following changes will occur:

  1. Deploy Consul Service Mesh by default - connectInject.enabled and controller.enabled will both be set to true by default to deploy the Connect Inject deployment and the CRD controller to apply CRDs on Kubernetes. If service mesh is not a use case that is utilized, the recommendation is to explicitly set those values to false to prevent undesired consequences upon upgrading to the version that introduces this change.

  2. Disabling of Client agents by default for Consul Service Mesh and Sync Catalog - Consul Service Mesh will introduce a new architecture that will allow for the deployment of Consul on Kubernetes without node-local agents running alongside workloads. This will remove the need for gossip communication amongst workload nodes and the need for a hostPort to be exposed on Kubernetes. As opposed to client agents deployed on each workload node, a new component will be introduced alongside each pod’s Envoy sidecar to allow for the configuration of the proxy itself by retrieving the configuration from the Consul servers via xDS. The Consul servers will still operate in a dual mode to allow for both traditional client agents to join the mesh outside of Kubernetes, and the ability for Envoy proxies to be managed by the new component. The implementation of this new architecture will enable the deployment of Consul on Kubernetes environments such as EKS Fargate and GKE Autopilot where hostPorts are not supported. In addition, for the use case where Consul Servers reside on a separate network than the networked utilized by workloads themselves, the removal of gossip communication amongst workload nodes will allow for a simplified network configuration by removing the need for network peering between workload nodes and Consul servers.

    Sync Catalog will also be modified to dial the servers directly instead of using a client agent.
    For the use cases outside of Sync Catalog and Service Mesh, i.e traditional service discovery via client agents and Consul Vault storage backend, Consul clients could still be deployed through the configuration of a separate Helm config stanza.

  3. Deploy Consul on Kubernetes with 1 server replica by default, to accommodate for a frictionless install experience for Kubernetes on local developer environments. Setting server.replicas to 3 is considered best practice for production deployments of Consul on Kubernetes, however it is not required for development deployments.

  4. Enable namespace mirroring by default (enterprise only) - Both connjectInject.consulNamespaces.mirroringK8S and syncCatalog.consulNamespaces.mirroringK8S will be set to true by default to allow for the registering of services to a mirrored Consul namespace of the same name, as opposed to a ‘default` namespace.

We’re excited to see these changes be released in a future release, and will look forward to these changes being implemented by Q4 2022.

Thank you,
The Consul on Kubernetes Team

After reading the second bullet point, I assume this will eventually make its way to Azure HCS? And if so, will you update the docs to indicate there will no longer be a need for network peering? What version of Consul will this probably be? 1.13.x? 1.14.x?

HI @DaleyKD One of our PMs on Consul Cloud has reached out to you to give you an update on our roadmap, we'll update here as well once we have had a chance to connect.

What about graceful shutdown delay option for envoy-sidecar?
Without it application is unable to finish network tasks (because envoy goes down very fast and network will be broken)
Thank you!

Hi @alt-dima This is on our roadmap for Consul 1.15, the release after the large upcoming 1.14 release happening in Q4 this year.

Does disabling agent nodes not have load/performance implications?

I thought one of the main functions of the local consul agent is to aggregate connections/requests and cache data from the server nodes to reduce load?

Will it possible to configure the sidecar to continue fetching from a local agent instead?

Regarding point 2, it great to remove the need for hostPort, and this is often not permitted on clusters.
However, this solution only seems to apply when using the service mesh, which makes some sense as you will already have a side car for Envoy.

For deployments that do not use the service mesh, will there be an architecture change to how the client agents are deployed, and their requirement for host ports and directories.

Hi @ncouse

Removing client agents will also affect catalog sync. Clients will be disabled by default, and even if enabled, none of the consul-k8s components will use them. You can still deploy them (with hostPorts) if you wish to use them.

Disabling agent's will also break Consul DNS, which is currently a k8s service pointing at all the agent pods.
Is there a new design for this too or will that just require agents enabling again?

So if I am using "traditional" Consul, i.e. only for Service Discovery, how does that work in the new architecture?
I am not using catalog-sync or service mesh, or DNS. I am just registering services with local client agents (since that is the current architecture), obtaining host ip via downward API.
However I do want a solution that removes the need for hostPort.

Hey @hamishforbes

Consul DNS points to both server and client agents, so if you're deploying servers, that will continue to work. If you have a cluster without servers and need consul DNS but you're not using service mesh, then you could still configure your kube DNS to point to consul servers that you've deployed somewhere else like we say in our docs (https://developer.hashicorp.com/consul/docs/k8s/dns). If neither of those work for you, you can enable clients.

For service mesh, we're planning to have a DNS proxy running as a sidecar together with envoy.

@ncouse

So if I am using "traditional" Consul, i.e. only for Service Discovery, how does that work in the new architecture?
I am not using catalog-sync or service mesh, or DNS. I am just registering services with local client agents (since that is the current architecture), obtaining host ip via downward API.
However I do want a solution that removes the need for hostPort.

To use the new architecture, you'd need to register services directly with Consul's catalog. For that, it's a bit easier to use catalog sync which will register services for you with consul's catalog for you unless that doesn't work for you for some reason. One caveat with that is that catalog sync currently doesn't support syncing k8s health check status into consul, and even if you register services yourself, you'd still need to sync the health check status because there's no more client agents doing the health checking for you. This is something we'll be looking into fixing in the future.

Along the same lines as the previous question about "traditional" Consul, only for Service discovery.

We currently use local client agents which our workloads use to register themselves and their health checks via logic built in to the application.

Is this change going to break client agents in general, or just change the default values in the helm chart so we'll have to enable the agents?
Will our current architecture continue to work, or do we have to come up with something new?
Catalog sync is not attractive to us since we use none of the service mesh bits other than federation and our application already has logic to handle registrations, lifecycle, discovering other services in multiple datacenters, etc...

Is this change going to break client agents in general, or just change the default values in the helm chart so we'll have to enable the agents?

We are just changing the default in this release so you will need to explicitly enable clients for your use case. In the future we may also remove client agents entirely though.

Catalog sync is not attractive to us since we use none of the service mesh bits other than federation and our application already has logic to handle registrations, lifecycle, discovering other services in multiple datacenters, etc...

Catalog sync is part of the service mesh automation on kubernetes. It only helps you sync kubernetes services into consul and vice versa so that you don't need to register them twice in both systems. It doesn't help you inject or register the envoy proxy that is needed for service mesh.

Thank you everyone for your feedback. Closing this issue, as we will be planning on releasing Consul K8s 1.0 next week which will include these breaking changes. Please try out the new release when its available and provide us more additional feedback if needed.