GoogleCloudPlatform/distributed-load-testing-using-kubernetes

Locust master + 130 locust workers, all tests suddenly fail

Opened this issue · 1 comments

I have a locust setup running on a 3 minion cluster. While watching the cpu usage of the minions and running 1300 users simulated and hatch rate 10 on 130 workers, suddenly all tests stop working and record only failure.

Type Name # requests # fails Median Average Min Max Content Size # reqs/sec
POST /login 0 352 0 0 0 0 0 0
POST /metrics 0 367376 0 0 0 0 0 0
Total 0 367728 0 0 0 0 0 0

I expected that something went wrong on the machines, but all workers are running and locust master us accessible without any issue:


....
locust-4069721582-xfcky     1/1       Running   4          20h       172.20.52.6     razvan-kube-minion1.openstack.local
locust-4069721582-xwwa6     1/1       Running   0          1h        172.20.140.35   razvan-kube-minion2.openstack.local
locust-4069721582-y3ij0     1/1       Running   0          1h        172.20.52.46    razvan-kube-minion1.openstack.local
locust-4069721582-y93zt     1/1       Running   0          20h       172.20.50.7     razvan-kube-minion0.openstack.local
locust-4069721582-yhjce     1/1       Running   0          20h       172.20.140.17   razvan-kube-minion2.openstack.local
locust-4069721582-ynj9r     1/1       Running   0          20h       172.20.52.23    razvan-kube-minion1.openstack.local
locust-4069721582-z3yte     1/1       Running   0          1h        172.20.52.36    razvan-kube-minion1.openstack.local
locust-4069721582-z5s3r     1/1       Running   0          20h       172.20.52.20    razvan-kube-minion1.openstack.local
locust-4069721582-z9k5l     1/1       Running   0          20h       172.20.140.12   razvan-kube-minion2.openstack.local
locust-4069721582-zkn79     1/1       Running   0          20h       172.20.50.19    razvan-kube-minion0.openstack.local
locust-4069721582-zkq1l     1/1       Running   1          20h       172.20.52.12    razvan-kube-minion1.openstack.local
locust-4069721582-zr8ox     1/1       Running   0          20h       172.20.140.15   razvan-kube-minion2.openstack.local
locust-4069721582-zt6e8     1/1       Running   0          1h        172.20.50.40    razvan-kube-minion0.openstack.local
locust-4069721582-zwpu2     1/1       Running   0          20h       172.20.140.14   razvan-kube-minion2.openstack.local
locust-master-wxpwd         1/1       Running   0          21h       172.20.140.86   razvan-kube-minion2.openstack.local

I presumed that the network has some issues so I pinged the TARGET_HOST=http://workload-simulation-webapp.appspot.com, the workers can ping the host

host-44-11-1-22:~ # kubectl exec locust-4069721582-zt6e8 -- ping -c 3 google.com 
PING google.com (74.125.133.113): 56 data bytes
64 bytes from 74.125.133.113: icmp_seq=0 ttl=42 time=104.528 ms
64 bytes from 74.125.133.113: icmp_seq=1 ttl=42 time=70.861 ms
64 bytes from 74.125.133.113: icmp_seq=2 ttl=42 time=71.639 ms
--- google.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 70.861/82.343/104.528/15.691 ms
host-44-11-1-22:~ # 

host-44-11-1-22:~ # kubectl exec locust-4069721582-zt6e8 -- ping -c 3 workload-simulation-webapp.appspot.com
PING appspot.l.google.com (74.125.133.141): 56 data bytes
64 bytes from 74.125.133.141: icmp_seq=0 ttl=42 time=14.486 ms
64 bytes from 74.125.133.141: icmp_seq=1 ttl=42 time=215.952 ms
64 bytes from 74.125.133.141: icmp_seq=2 ttl=42 time=14.178 ms
--- appspot.l.google.com ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 14.178/81.539/215.952/95.045 ms
host-44-11-1-22:~ # 

The kubectl logs on a random workers report a normal status:

.host-44-11-1-22:~ # kubectl logs locust-4069721582-y93zt
/usr/local/bin/locust -f /locust-tasks/tasks.py --host=http://workload-simulation-webapp.appspot.com --slave --master-host=172.20.140.86
[2016-11-28 14:18:21,053] locust-4069721582-y93zt/INFO/locust.main: Starting Locust 0.7.2
[2016-11-28 15:38:18,295] locust-4069721582-y93zt/INFO/locust.runners: Hatching and swarming 5 clients at the rate 0.1 clients/s...
[2016-11-28 15:39:08,468] locust-4069721582-y93zt/INFO/locust.runners: All locusts hatched: MetricsLocust: 5
[2016-11-28 15:39:08,469] locust-4069721582-y93zt/INFO/locust.runners: Resetting stats

[2016-11-29 08:07:14,152] locust-4069721582-y93zt/INFO/locust.runners: Hatching and swarming 5 clients at the rate 0.1 clients/s...
[2016-11-29 08:08:04,239] locust-4069721582-y93zt/INFO/locust.runners: All locusts hatched: MetricsLocust: 5
[2016-11-29 08:08:04,241] locust-4069721582-y93zt/INFO/locust.runners: Resetting stats

[2016-11-29 09:06:41,863] locust-4069721582-y93zt/INFO/locust.runners: Hatching and swarming 10 clients at the rate 0.0769231 clients/s...
[2016-11-29 09:08:52,472] locust-4069721582-y93zt/INFO/locust.runners: All locusts hatched: MetricsLocust: 10
[2016-11-29 09:08:52,504] locust-4069721582-y93zt/INFO/locust.runners: Resetting stats

[2016-11-29 10:14:32,405] locust-4069721582-y93zt/INFO/locust.runners: Hatching and swarming 10 clients at the rate 0.0769231 clients/s...
[2016-11-29 10:16:43,046] locust-4069721582-y93zt/INFO/locust.runners: All locusts hatched: MetricsLocust: 10
[2016-11-29 10:16:43,145] locust-4069721582-y93zt/INFO/locust.runners: Resetting stats

host-44-11-1-22:~ # 

What cold be the issue here? I run out of ideas :)

The failures page has the following description:

# fails	Method	Name	Type
17	POST	/metrics	"HTTPError('503 Server Error: Service Unavailable',)"