Rate Limiting causes apps to become unresponsive even after no traffic
andrew-edgar opened this issue ยท 11 comments
Issue
We have added rate and burst properties to our deployment in the silk-cni job. This works well for a time but some of our apps after having a bunch of traffic seem to stop being responsive.
We added rate 100000 and burst to be 300000 in the deployment properties of the silk-cni job.
This is a very high priority issue for us as this is effecting customer applications in IBM cloud due to turning on of rate limiting in the silk-release layer. To mitigate the problem app users must "restart" their app to restart the app in a new container.
Context
This is Cf-deployment 6.8.0 using silk-release 2.20.
This is on Softlayer using the 250.29 version of the xenial stemcell. We would be happy to provide more information if required.
Steps to Reproduce
We have a set of tests that push an app and then curls it every second. After a while the apps become unresponsive and we get 502 error.
Expected result
No 502 errors on apps that are running and using rate limiting.
Current result
We get 502 errors and timeouts to the app. No traffic is able to get into the containers.
Possible Fix
This is an issue with the TC plugin for rate limits. We see the dropped packets appearing here
diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:~$ tc -s -d -iec qdisc show dev s-010246217233
qdisc tbf 1: root refcnt 2 rate 100000Kibit burst 300000Kb/1 mpu 0b lat 25.0ms linklayer unspec
Sent 23871537 bytes 170092 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc ingress ffff: parent ffff:fff1 ----------------
Sent 29885724 bytes 346732 pkt (dropped 298702, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
There is NO traffic to the app for over 12 hours. But when I try to curl it from the cell below I get this ...
1) an increase in the dropped count ...
diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:~$ tc -s -d -iec qdisc show dev s-010246217233
qdisc tbf 1: root refcnt 2 rate 100000Kibit burst 300000Kb/1 mpu 0b lat 25.0ms linklayer unspec
Sent 23871933 bytes 170098 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc ingress ffff: parent ffff:fff1 ----------------
Sent 29886296 bytes 346743 pkt (dropped 298713, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
2) the timeout of the curl ...
diego-cell/e6a511ca-dabe-48ea-bae4-2f70b55b15ee:/var/vcap/data/garden/depot# curl 10.246.217.75:8080
curl: (7) Failed to connect to 10.246.217.75 port 8080: Connection timed out
We have used "garden" commands to get into the container (cf ssh also does not work
) and do a local curl to the 8080 endpoint and that does work. So the application is working inside the container. But no packets are getting into the container.
We have other apps running on this cell which are working fine. We do see that rate limiting is mostly working but wonder if this is a kernel issue where this stops working. Where can we provide more information?
We had a couple of chats in the #network slack listed here with some more details ...
https://cloudfoundry.slack.com/conversation/CFX13JK7B/p1558021807216700
https://cloudfoundry.slack.com/conversation/CFX13JK7B/p1558029702229900
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/166087446
The labels on this github issue will be updated when the story is started.
we have seen similar behavior, the reason why there are dropped packages is that the sibling interface which consists of a 4 digit/letter hash vanishes. This interface is part of the overall architecture of throttling inbound/outbound traffic, see picture: https://github.com/cloudfoundry/silk-release/blob/master/docs/bandwidth-limiting.md . However we did not found the real rootcause why the interfaces vanishes. One thought was a hash collision...
diego-cell/0a5ba503-311b-4433-ab7a-5b93713fdd79:~# ls /sys/devices/virtual/net/ | grep -v -E "(silk-vtep|lo|ifb0|ifb1)" | grep -v "^s-" | wc -l
64
diego-cell/0a5ba503-311b-4433-ab7a-5b93713fdd79:~# ls /sys/devices/virtual/net/ | grep -v -E "(silk-vtep|lo|ifb0|ifb1)" | grep "^s-" | wc -l
67
Hey @andrew-edgar and @h0nIg
We released a new version which includes a bump of the module we're using to configure the TC Plugin. Would you mind consuming this new version and checking if your issue still persists?
Thanks we were just going to pick this up to see if it makes any differences
We tried to repoduce that bug (on silk release 2.23) and found indeed interface name collisions.
The first 4 digits of an sha1 hash with the given input parameters is causing this:
https://github.com/containernetworking/plugins/blob/0eddc554c0747200b7b112ce5322dcfa525298cf/plugins/meta/bandwidth/main.go#L114-L122
We tested the creation of 490 instances (via scaling up / down) on 2 diego-cells and could enforce the issue with the following script:
#!/bin/bash
# prerequisite: login to cf and bosh director
# the app name is hard coded to "my-go-app"
while true; do
echo "scale up to 490 instances"
cf scale my-go-app-1 -i 490
while ! cf a | grep -q '490/490'; do
sleep 2
done
echo "490 instances running. start connection test"
bosh -d cf ssh diego-cell opts="-o connecttimeout=10" \
"wget -qO /tmp/check_container >/dev/null \
https://gist.githubusercontent.com/max-soe/e0c139b82113adaeffbaafef55262c69/raw/3b0725b0fa473672c89b81495a75e5bb8618066b/connection_check_v3; \
sudo su -c 'bash /tmp/check_container'; sudo su -c 'rm -f /tmp/check_container'" \
2>&1 | tee -a connection_test_results.txt
echo "scale down to one instance"
cf scale my-go-app-1 -i 1
echo "waiting 5 min"
sleep 300
echo "Finish tests. Restarting in 20 seconds"
sleep 20
done
We saw repeatedly about one container not able to reach the internet. At the same time we saw the "adding link" error in the garden stdout log:
grep adding\ link /var/vcap/sys/log/garden/garden.stdout.log
To sum up:
- The issue still exists
- The implementation of unique interface names is not "unique" enough
Upstream fix (containernetworking/plugins#353) has been merged. We're unblocked to integrate the fix into Silk.
Integrated into silk. Running through our pipelines currently
Has this been delivered yet?
Hi @andrew-edgar . Sorry for the late response. Yes, this was delivered in v2.24.0
Please update your silk-release
and if the problem persists reply here.