matschaffer/knife-solo

Connection reset by peer during long knife solo cook

Opened this issue · 5 comments

I have long running knife solo script (20-30 minutes), which performs following installation on AWS instance.

ERROR: Errno::ECONNRESET: Connection reset by peer - recvfrom(2)

  1. command:
    knife solo bootstrap ubuntu@<ip_address> nodes/.json
  2. recipe:
    The underlying recipe installs necessary python infrastructure from scratch

a. numpy
b. pandas
c. scipy
d. matplotlib

  1. target machine

On target machine cc1/cc1plus is running at > 90% CPU - runs for roughly 20+ minutes

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27230 ubuntu 20 0 633540 584520 9440 R 98.5 57.5 0:45.89 cc1plus

It always results into connection reset issue, after re-running it completes successfully.

I see that in one of the thread suggestion was to "use proxy settings". I am not sure, whether it applies here as I am running these scripts from northeast US and my AWS instance is running in US. East as well.

I would appreciate any recommendation on alternative approach/settings to avoid such issue.

Have you tried ssh-level keepalive settings?

For example I keep this in my ~/.ssh/config

Host *
  ServerAliveInterval 30
  ServerAliveCountMax 5

I suspect it's not knife-solo in particular but just any long-lived idle ssh connection would get cut in your setup.

Thanks for the quick reply and suggestion.

Even after trying out above options, it produces exact same behavior.
a. Is it safe to increase ServerAliveInterval?
b. Is there any better way of installing numpy, scipy, etc... python scientific stack

Should be fine to increase, though it may not solve the problem.

The python scientific stack should have fairly recent versions available in package repos for major OSes.

Or if you need your own you can build them once and package using a tool like https://github.com/jordansissel/fpm

If you can install via a pre-compiled package the provisioning should be a lot faster which may avoid the need to keep SSH open & idle.

Finally, if that's not an option, you may want to investigate the server's sshd settings (e.g., that TCPKeepAlive is turned on, and ClientAlive* aren't set too short). And finally any firewall settings between you and the server (since one commenter talks about it here).

I don't have experience with either, but in your opinion which option would work better from scalability/flexibility perspective?

a. packer image: perform pip install on an AMI, save AMI and use it for further deployment via chef-solo
b. fpm package: download all packages locally and build package upfront

I tend to prefer a combination of both. The process goes like.

  1. use distro-provided packaged when possible
  2. use fpm (or other tool) to create packages that aren't available via the distro
  3. start from your distro and use a CM tool like chef to get all the packages installed and minimally configured
  4. package the result as an AMI for deployment
  5. use a runtime CM tool (zookeeper, etcd, consul, etc) to handle environment specific configuration during app start/run

Though I see many people skip step 4 & 5 since many of the rewards of AMI-based deployments only come once you're using autoscaling groups. And 5 requires additional infrastructure.