pollSpotTermination leaks sockets
itsdalmo opened this issue · 2 comments
Version: 2.0.1
(and 2.0.2
)
In https://github.com/buildkite/lifecycled/blob/master/spot.go#L38 we are doing HTTP requests in a loop and deferring calls to res.Body.Close()
. Deferred calls are not run until the function returns, which this function does not do until it is interrupted by a signal.
The result is that lsof -i
outputs the following after lifecycled
has run for a while:
lifecycle 29348 root 776u IPv4 878016 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36026->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 778u IPv4 878049 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36038->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 780u IPv4 878079 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36068->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 781u IPv4 878650 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36074->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 782u IPv4 878295 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36084->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 783u IPv4 878297 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36088->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 784u IPv4 878660 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36092->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 785u IPv4 878670 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36114->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 786u IPv4 878677 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36122->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 787u IPv4 878730 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36136->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 788u IPv4 878390 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36140->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 789u IPv4 878415 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36146->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 790u IPv4 878789 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36154->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 791u IPv4 878435 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36176->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
lifecycle 29348 root 792u IPv4 878443 0t0 TCP ip-10-11-32-57.eu-west-1.compute.internal:36182->instance-data.eu-west-1.compute.internal:http (CLOSE_WAIT)
And in following causes:
Sep 04 04:46:18 ip-10-11-32-57.eu-west-1.compute.internal lifecycled[9081]: time="2018-09-04T04:46:18Z" level=info msg="Failed to query metadata service" error="Get http://169.254.169.254/latest/meta-data/spot/termination-time: dial tcp 169.254.169.254:80: socket: too many open files"
Sep 04 04:46:21 ip-10-11-32-57.eu-west-1.compute.internal lifecycled[9081]: time="2018-09-04T04:46:21Z" level=fatal msg="Failed to delete queue: RequestError: send request failed\ncaused by: Post https://sqs.eu-west-1.amazonaws.com/: dial tcp: lookup sqs.eu-west-1.amazonaws.com on 10.11.0.2:53: dial udp 10.11.0.2:53: socket: too many open files"
Solution
- Don't use
defer
inside a loop, and instead close the body explicitly after reading it. - Use the built in
ec2metadata
service in AWS SDK GO.
I can make a PR for the first one, and then a new PR if we decide to switch to number 2 - let me know what you want.
Interesting, when I look at the logs for an instance that has failed to cleanup it's queue, I see this:
Sep 4 20:55:23 ip-172-31-30-99 lifecycled: time="2018-09-04T20:55:23Z" level=info msg="Failed to query metadata service" error="Get http://169.254.169.254/latest/meta-data/spot/termination-time: dial tcp 169.254.169.254:80: socket: too many open files"
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: time="2018-09-04T20:57:14Z" level=error msg="Failed to delete queue" error="RequestError: send request failed\ncaused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: dial udp 10.0.0.2:53: socket: too many open files"
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: lifecycled: error: RequestError: send request failed
Sep 4 20:57:14 ip-172-31-30-99 lifecycled: caused by: Post https://sqs.us-east-1.amazonaws.com/: dial tcp: lookup sqs.us-east-1.amazonaws.com on 10.0.0.2:53: no such host, try --help
Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service: main process exited, code=exited, status=1/FAILURE
Sep 4 20:57:14 ip-172-31-30-99 systemd: Unit lifecycled.service entered failed state.
Sep 4 20:57:14 ip-172-31-30-99 systemd: lifecycled.service failed.
Perhaps this open file issue is the missing key?
Perhaps this open file issue is the missing key?
I believe so 👍