megaease/easemesh

[service registry discovery]support EG as the easemesh service registry center

benja-wu opened this issue · 4 comments

Background

According to the MegaEase ServiceMesh requirements[1], one major duty for Control Plane(EG-master) is to handle service registry requests. Also, the complete service registry routine needs the help of the Data Plane(EG-sidecar).

Proposal

Registry metadata

{
   // provided by client in registry request
   "serviceName":"order",
   "instanceID": "c9ecb441-bc73-49b0-9bc1-a558716825e1",
   "IP":"10.168.11.3",
   "port":"63301",

   // find in meshService spec
   "tenant":"takeaway“

   // depends on instance heartbeat, can be modify by API
   "status":"UP",
   //  has default value, can be modify by API
   "leases":1929499200,
   // recorded by system, read only
   "registryTime": 1614066694
}

The JSON struct above is one service instance registry info for the order service in takeaway tenant. It has a UUID. By default, its leases will be available for ten years. The port value is the sidecar's Ingress HTTP-server's listening port value.

ETCD data layout

  • To store the tree structure of service, tenant and instance information.
  • One tenant can have one or more service records.
  • One service should have at least one instance records.
  • One instance has one registry record and one heartbeat record.
  • So the tree layout in etcd store looks like:
	meshServicesPrefix              = "/mesh/services/%"                // +serviceName (its value is the basic mesh spec)
	meshServicesResiliencePrefix    = "/mesh/services/%s/resilience"    // +serviceName(its value is the mesh resilience spec)
	meshServicesCanaryPrefix        = "/mesh/services/%s/canary"        // + serviceName(its value is the mesh canary spec)
	meshServicesLoadBalancerPrefix  = "/mesh/services/%s/loadBalancer"  //+ serviceName(its value is the mesh loadBalance spec)
	meshSerivcesSidecarPrefix       = "/mesh/serivces/%s/sidecar"       // +serviceName (its value is the sidecar spec)
	meshServicesObservabilityPrefix = "/mesh/services/%s/observability" // + serviceName(its value is the observability spec)

	meshServiceInstancesPrefix         = "/mesh/services/%s/instances/%s"           // +serviceName + instanceID( its value is one instance registry info)
	meshServiceInstancesHearbeatPrefix = "/mesh/services/%s/instances/%s/heartbeat" // + serviceName + instanceID (its value is one instance heartbeat info)
	meshTenantServicesListPrefix       = "/mesh/tenants/%s"                        // +tenantName (its value is a service name list belongs to this tenant)

Control Plane

  1. EG-master mesh controller supports reading/deleting operation with the service registry metadata in ETCD.
  2. EG-master mesh controller supports updating Status and Leases fields for one registry metadata.
  3. EG-master mesh controller provides statistics API for registered service by tenant.
  • How many instances of one registered service in mesh? Say we have one service called order, it has two instances. Their IDs are c9ecb441-bc73-49b0-9bc1-a558716825e1 and c9ecb441-bc73-49b0-9bc1-a55871680000:
$ ./etcdctl get "/mesh/services/order/instances" --prefix
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a558716825e1
{"serviceName":"order","instanceID": "c9ecb441-bc73-49b0-9bc1-a558716825e1","IP":"10.168.11.3","port":"63301","status":"UP","leases":1929499200,"tenant":"tenant-001“}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a558716825e1/heartbeat
{"lastActiveTime":1614066694}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a55871680000
{"serviceName":"order","instanceID": "c9ecb441-bc73-49b0-9bc1-a55871680000","IP":"10.168.11.4","port":"63301","status":"UP","leases":1929499200,"tenant":"tenant-001“}
/mesh/services/order/instances/c9ecb441-bc73-49b0-9bc1-a55871680000/heartbeat
{"lastActiveTime":1614066694}

  • How many services and their instance for one tenant in mesh? Say we have one tenant call tenant-001 and it has two services, one is order, the other is address:
$./etcdctl get "/mesh/tenants" --prefix
tenant-001
{"desc":"this is a demo tenant","createdTime": 1614066694}
$ ./etcdctl get "/mesh/tenants/tenant-001" 
["order","address"]
  1. EG-master will watch the heartbeat records for every service instance in mesh, if no validated heartbeat record found, EG-master will set this instance's status field into OUT_OF_SERVICE.

Data Plane

  1. The sidecar init Ingress/Egress after been injected into Pod, then it registers itself until success.
  2. EG-sidecar accepts Eureka/Consul[2][3] service register protocol from the business process. EG-sidecar don't depend on the business process' register request.
  3. sequence diagram
    Service-Registry-Register Sequence
  4. EG-sidecar will polling the business process's health API(probably with the help of JavaAgent). Then report this heartbeat into ETCD.
  5. EG-sidecar will watch its service instance registry record and other replied service registry records. Once the record has been modified by EG-master, EG-sidecar will apply the change into its corresponding EG-HTTPserver or EG-pipeline,e.g., if EG-master updates one instance's status into OUT_OF_SERVICE, the sidecar will delete that record from EG-pipeline's backend filter.

Reference

[1] mesh requirements https://docs.google.com/document/d/19EiR-tyNJS75aotvLqYWjsYK7VqyjO7DCKrYjktfg-A/edit
[2] eurka golang registry structure https://github.com/ArthurHlt/go-eureka-client/blob/3b8dfe04ec6ca280d50f96356f765edb845a00e4/eureka/requests.go#L38
[3] consul catalog registry structure https://pkg.go.dev/github.com/hashicorp/consul/api@v1.7.0#CatalogRegistration

I have a suggestion:

in the data plane, EG-sidecar, when it receives a registry request, it doesn't register to the Etcd immediately. The sidecar just returns a successful result to the real service no matter what's the result of the etcd registration. The sidecar registration will be designed as the level trigger design, it's an asynchronous registration.


ps: s/pooling/polling/g

Got it. EG-sidecar's asynchronous registration will be more resilient when the network inside a Pod becomes unstable or something else happens.

ps: replacing done.

After discussion with @xxx7xxxx and @zhao-kun , the original ETCD storage layout

        meshServicesPrefix              = "/mesh/services/%"                // +serviceName (its value is the basic mesh spec)
	meshServicesResiliencePrefix    = "/mesh/services/%s/resilience"    // +serviceName(its value is the mesh resilience spec)
	meshServicesCanaryPrefix        = "/mesh/services/%s/canary"        // + serviceName(its value is the mesh canary spec)
	meshServicesLoadBalancerPrefix  = "/mesh/services/%s/loadBalancer"  //+ serviceName(its value is the mesh loadBalance spec)
	meshSerivcesSidecarPrefix       = "/mesh/serivces/%s/sidecar"       // +serviceName (its value is the sidecar spec)
	meshServicesObservabilityPrefix = "/mesh/services/%s/observability" // + serviceName(its value is the observability spec)

will be merged into one spec in ETCD

        meshServicesPrefix              = "/mesh/services/%"

Finished and merged into EG mesh branch already.