AliyunContainerService/terway

tune RPS automatically to avoid imbalanced softIRQ of newly-created network interface

gaorong opened this issue · 3 comments

We have deployed terway in our oversea Kubernetes cluster and are migrating out legacy jobs into this cluster. But we hit some issues in the meantime. when we move a deployment into this cluster, which handles an extremely high number of simultaneously active connections and has a high QPS, we found the network latency become high sometimes. After digging into this problem, we found the softIRQ is high and is almost distributed at a sole CPU core.

image

As we always tune our host network parameter before the host is added into Kubernetes cluster, this issue seems really wired. Then we found the newly-created network interface's RPS parameter is not been touched and has the default value: 00000000. we thought this might be the reason. so we made a test with the network interface's receive queue RPS has 00000000 and ffffffff value separately. then we got our softIRQ distribution metrics as below.

image
(note: this metric is generated in a test cluster, so the peak of softIRQ is not as high as production environment )

As we can see, the RPS parameter in /sys/class/net/eth*/queues/rx-*/rps_cpus can greatly affect the distribution of softIRQ in each CPU core and the performance of network, we should carefully tune this parameter for each network interface.
The network card is created by terway dynamically and other applications can hardly detect the creation/deletion event in time, So maybe terway should have this ability naturally.

Do we have any ideas about adding this feature in terway?

@gaorong see detail in https://help.aliyun.com/document_detail/52559.html to distribute the softIRQ.

This can be setting in ECS OS by install the ecs_mq service: https://help.aliyun.com/document_detail/52559.html.

close this now and wait for more evidence.