Connection reset by peer during long knife solo cook

Question

Connection reset by peer during long knife solo cook

rajanshah opened this issue 9 years ago · 5 comments

rajanshah commented 9 years ago

I have long running knife solo script (20-30 minutes), which performs following installation on AWS instance.

ERROR: Errno::ECONNRESET: Connection reset by peer - recvfrom(2)

command:
knife solo bootstrap ubuntu@<ip_address> nodes/.json
recipe:
The underlying recipe installs necessary python infrastructure from scratch

a. numpy
b. pandas
c. scipy
d. matplotlib

target machine

On target machine cc1/cc1plus is running at > 90% CPU - runs for roughly 20+ minutes

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27230 ubuntu 20 0 633540 584520 9440 R 98.5 57.5 0:45.89 cc1plus

It always results into connection reset issue, after re-running it completes successfully.

I see that in one of the thread suggestion was to "use proxy settings". I am not sure, whether it applies here as I am running these scripts from northeast US and my AWS instance is running in US. East as well.

I would appreciate any recommendation on alternative approach/settings to avoid such issue.

Answer 1 · 2015-10-05T00:46:24.000Z

Have you tried ssh-level keepalive settings?

For example I keep this in my ~/.ssh/config

Host *
  ServerAliveInterval 30
  ServerAliveCountMax 5

I suspect it's not knife-solo in particular but just any long-lived idle ssh connection would get cut in your setup.

Answer 2 · 2015-10-06T21:09:27.000Z

Thanks for the quick reply and suggestion.

Even after trying out above options, it produces exact same behavior.
a. Is it safe to increase ServerAliveInterval?
b. Is there any better way of installing numpy, scipy, etc... python scientific stack

Answer 3 · 2015-10-09T01:17:37.000Z

Should be fine to increase, though it may not solve the problem.

The python scientific stack should have fairly recent versions available in package repos for major OSes.

Or if you need your own you can build them once and package using a tool like https://github.com/jordansissel/fpm

If you can install via a pre-compiled package the provisioning should be a lot faster which may avoid the need to keep SSH open & idle.

Finally, if that's not an option, you may want to investigate the server's sshd settings (e.g., that TCPKeepAlive is turned on, and ClientAlive* aren't set too short). And finally any firewall settings between you and the server (since one commenter talks about it here).

Answer 4 · 2015-10-26T16:37:27.000Z

I don't have experience with either, but in your opinion which option would work better from scalability/flexibility perspective?

a. packer image: perform pip install on an AMI, save AMI and use it for further deployment via chef-solo
b. fpm package: download all packages locally and build package upfront

Answer 5 · 2015-10-27T00:37:19.000Z

I tend to prefer a combination of both. The process goes like.

use distro-provided packaged when possible
use fpm (or other tool) to create packages that aren't available via the distro
start from your distro and use a CM tool like chef to get all the packages installed and minimally configured
package the result as an AMI for deployment
use a runtime CM tool (zookeeper, etcd, consul, etc) to handle environment specific configuration during app start/run

Though I see many people skip step 4 & 5 since many of the rewards of AMI-based deployments only come once you're using autoscaling groups. And 5 requires additional infrastructure.