Rate Limiting causes apps to become unresponsive even after no traffic

Question

Rate Limiting causes apps to become unresponsive even after no traffic

andrew-edgar opened this issue 5 years ago · 11 comments

Issue

We have added rate and burst properties to our deployment in the silk-cni job. This works well for a time but some of our apps after having a bunch of traffic seem to stop being responsive.

We added rate 100000 and burst to be 300000 in the deployment properties of the silk-cni job.

This is a very high priority issue for us as this is effecting customer applications in IBM cloud due to turning on of rate limiting in the silk-release layer. To mitigate the problem app users must "restart" their app to restart the app in a new container.

Context

This is Cf-deployment 6.8.0 using silk-release 2.20.

This is on Softlayer using the 250.29 version of the xenial stemcell. We would be happy to provide more information if required.

Steps to Reproduce

We have a set of tests that push an app and then curls it every second. After a while the apps become unresponsive and we get 502 error.

Expected result

No 502 errors on apps that are running and using rate limiting.

Current result

We get 502 errors and timeouts to the app. No traffic is able to get into the containers.

Possible Fix

This is an issue with the TC plugin for rate limits. We see the dropped packets appearing here

diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:~$ tc -s -d -iec qdisc show dev s-010246217233
qdisc tbf 1: root refcnt 2 rate 100000Kibit burst 300000Kb/1 mpu 0b lat 25.0ms linklayer unspec
 Sent 23871537 bytes 170092 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc ingress ffff: parent ffff:fff1 ----------------
 Sent 29885724 bytes 346732 pkt (dropped 298702, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

There is NO traffic to the app for over 12 hours. But when I try to curl it from the cell below I get this ...

1) an increase in the dropped count ...
diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:~$ tc -s -d -iec qdisc show dev s-010246217233
qdisc tbf 1: root refcnt 2 rate 100000Kibit burst 300000Kb/1 mpu 0b lat 25.0ms linklayer unspec
 Sent 23871933 bytes 170098 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
qdisc ingress ffff: parent ffff:fff1 ----------------
 Sent 29886296 bytes 346743 pkt (dropped 298713, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0

2) the timeout of the curl ...
diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:/var/vcap/data/garden/depot# curl 10.246.217.75:8080
curl: (7) Failed to connect to 10.246.217.75 port 8080: Connection timed out

We have used "garden" commands to get into the container (cf ssh also does not work) and do a local curl to the 8080 endpoint and that does work. So the application is working inside the container. But no packets are getting into the container.

We have other apps running on this cell which are working fine. We do see that rate limiting is mostly working but wonder if this is a kernel issue where this stops working. Where can we provide more information?

We had a couple of chats in the #network slack listed here with some more details ...

https://cloudfoundry.slack.com/conversation/CFX13JK7B/p1558021807216700
https://cloudfoundry.slack.com/conversation/CFX13JK7B/p1558029702229900

Answer 1 · 2019-05-17T14:23:54.000Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/166087446

The labels on this github issue will be updated when the story is started.

Answer 2 · 2019-06-07T07:34:29.000Z

we have seen similar behavior, the reason why there are dropped packages is that the sibling interface which consists of a 4 digit/letter hash vanishes. This interface is part of the overall architecture of throttling inbound/outbound traffic, see picture: https://github.com/cloudfoundry/silk-release/blob/master/docs/bandwidth-limiting.md . However we did not found the real rootcause why the interfaces vanishes. One thought was a hash collision...

diego-cell/0a5ba503-311b-4433-ab7a-5b93713fdd79:~# ls /sys/devices/virtual/net/ | grep -v -E "(silk-vtep|lo|ifb0|ifb1)" | grep -v "^s-" | wc -l
64
diego-cell/0a5ba503-311b-4433-ab7a-5b93713fdd79:~# ls /sys/devices/virtual/net/ | grep -v -E "(silk-vtep|lo|ifb0|ifb1)" | grep "^s-" | wc -l
67

Answer 3 · 2019-06-17T17:33:37.000Z

Hey @andrew-edgar and @h0nIg
We released a new version which includes a bump of the module we're using to configure the TC Plugin. Would you mind consuming this new version and checking if your issue still persists?

Answer 4 · 2019-06-18T16:15:32.000Z

Thanks we were just going to pick this up to see if it makes any differences

Answer 5 · 2019-07-04T11:14:17.000Z

We tried to repoduce that bug (on silk release 2.23) and found indeed interface name collisions.
The first 4 digits of an sha1 hash with the given input parameters is causing this:
https://github.com/containernetworking/plugins/blob/0eddc554c0747200b7b112ce5322dcfa525298cf/plugins/meta/bandwidth/main.go#L114-L122

We tested the creation of 490 instances (via scaling up / down) on 2 diego-cells and could enforce the issue with the following script:

#!/bin/bash
# prerequisite: login to cf and bosh director
# the app name is hard coded to "my-go-app"


while true; do
        echo "scale up to 490 instances"
        cf scale my-go-app-1 -i 490

        while ! cf a | grep -q '490/490'; do
                sleep 2
        done
        echo "490 instances running. start connection test"

        bosh -d cf ssh diego-cell opts="-o connecttimeout=10" \
        "wget -qO /tmp/check_container >/dev/null \
        https://gist.githubusercontent.com/max-soe/e0c139b82113adaeffbaafef55262c69/raw/3b0725b0fa473672c89b81495a75e5bb8618066b/connection_check_v3; \
        sudo su -c 'bash /tmp/check_container'; sudo su -c 'rm -f /tmp/check_container'" \
        2>&1 | tee -a connection_test_results.txt

        echo "scale down to one instance"
        cf scale my-go-app-1 -i 1
        echo "waiting 5 min"
        sleep 300

        echo "Finish tests. Restarting in 20 seconds"
        sleep 20
done

We saw repeatedly about one container not able to reach the internet. At the same time we saw the "adding link" error in the garden stdout log:

grep adding\ link  /var/vcap/sys/log/garden/garden.stdout.log

To sum up:

The issue still exists
The implementation of unique interface names is not "unique" enough

Answer 6 · 2019-07-08T21:28:05.000Z

Thanks, @max-soe for providing a detailed example of how to reproduce it. We will work on this and we will use the script provided as the acceptance criteria.

Thanks!
@mike1808 and @rodolfo2488, CF Networking Program Members

Answer 7 · 2019-07-18T21:39:51.000Z

Upstream fix (containernetworking/plugins#353) has been merged. We're unblocked to integrate the fix into Silk.

Answer 8 · 2019-07-22T21:48:10.000Z

Integrated into silk. Running through our pipelines currently

Answer 9 · 2019-09-10T20:18:01.000Z

Has this been delivered yet?

Answer 10 · 2019-09-13T21:03:11.000Z

Hi @andrew-edgar . Sorry for the late response. Yes, this was delivered in v2.24.0
Please update your silk-release and if the problem persists reply here.

Answer 11 · 2019-09-30T17:20:09.000Z

Hi all,

We are closing this issue due to inactivity. We fixed this in 2.24.0. Let us know if you are still seeing issues.

Thanks,
@ameowlia and @mike1808 , CF Networking Program Members