azukiapp/azk

Azk agent not running

Closed this issue · 19 comments

I have a few running services inside azk vm and a few azk shells started, which I can access.

The http loadbalancer stopped working (connection reset by peer), and azk status says that agent is not running. However, the containers are open and running. Any ideas ? In the last few days I had a lot of problems with azk (last 2 versions, 0.14.4 and 0.15.0), the vm stopped completely on system sleep twice.

Before, on azk 0.12.1 everything was fine.

Are there any logs I can paste here ?

$ azk status
? The agent is not running, would you like to start it? No
azk: azk agent is required but is not running (try `azk agent status`)

Hi @teodor-pripoae ! Thanks for your feedback!

Sorry about this. The root cause should be a routine inside azk that verifies if Docker daemon is up and, otherwise, stops azk agent.

This was included in #479 and #493.

The goal of those PRs was exactly the opposite: once the agent components wasn't properly working, the agent should shutdown itself and give the user a chance to bring it up again instead of receiving weird error messages.

For now, we recommend you to use azk agent with the option --no-daemon in a separated terminal tab. This way, if the agent stops, you can notice and start it again.

We'll prioritize this for solving.

Hi,

Thanks for your fix !. How is the recommended way to do it ?

# Before I was doing:
$ azk agent start
$ azk start ....

# Now like this ?
$ azk agent --no-daemon # this will keep daemon in foreground ?
$ azk start ...

Is this related to the bug that was causing high cpu usage for the vm when monitoring docker ? It was fixed, since after upgrading my vm is using under 20% CPU with 31 services running, but it keeps stopping.

Can this issue be related to running a lot of services ? Does it hit some timeout when checking each services from docker ?

@fearenales in fact the current approach isn't the best. There are situations in which VirtualBox can suspend the VM for a while or Docker service can go down.

The flux should be modified to:

  • Check the current VM state if it's running or VirtualBox is hanging it;
  • Periodically check for Docker service's integrity and restart it if it's down;
  • If Docker service cannot be restarted, restart the VM;
  • If none of the actions above can be succesfully run, agent stops with a failure;

Important:

  • Currently, azk agent is a monolitic block. An improvement to restart components of the agent and making the agent itself a subsystems monitor (our idea for the future) would be needed;
  • Restarting Docker is specially complex, we have components of azk agent that run inside Docker and those components should also be restarted after Docker is restarted;
  • After Docker is restarted, all applications started with azk would be down. This is not a real problem, given Docker just has failed, but it's important for the user to know they're down due to this reason, not an internal error.

@nuxlli Are there any places where I can patch and remove this checks until a patch is released ?

I never restart docker service inside the vm, and I can wait a little after I wake the system from sleep.

I don't know if this checks were back in 0.12.1, but I never had problem with docker restarting suddenly. I guess this checks are needed for linux version of azk, but where can I remove them temporarily until next release ?

@teodor-pripoae Yes, --no-daemon will keep agent in foreground, so you'll need to use another terminal to run azk start.

Yes, we've fixed the high CPU usage in azk v0.14.5, but the main issue is the poor check strategy pointed by @nuxlli .

@teodor-pripoae you can do the following patch:

  • Add a new config (call it monitor) into https://github.com/azukiapp/azk/blob/master/src/config.js#L86-L99, related to a env var (call it AZK_DOCKER_MONITOR) which default value should be true if the current SO is Linux or false otherwise. To check the SO, import so (like this) in the config.js file and use os.platform() === 'linux' as the default value for that config;
  • Wrap this code with an ifstatement checking if config('docker:monitor') is true;

After doing this, you should be able to run a make and use the azk file placed in the bin file in the azk project dir (it's a good idea to create a new alias for this an use it instead of normal azk meanwhile).

This should work and would be awesome if you could sent your solution as a Pull Request!

Any issue or concern, please let me know!

Cool ! Thank you, I will try this in a few hours and submit a PR if it works :)

Hey @teodor-pripoae , just checked out your PR. Great job! Did that solve your problem?

Yes, azk didn't stopped the vm yet, so it seems to work. :)

I've just run my batch of tests and everything seems ok.

@fearenales

Btw, do you know why the test suite gives me this error ? Do I need linux ?

$ azk nvm npm test

> azk@0.15.0 test /Users/toni/code/gh/azk
> make test

task: test
/Users/toni/code/gh/azk/bin/azk nvm gulp test  ""
[16:48:30] Using gulpfile ~/code/gh/azk/gulpfile.js
make: *** [test] Error 1
npm ERR! Test failed.  See above for more details.

@teodor-pripoae Use this to run the test suite:

$ azk nvm gulp test --slow

I'm sorry for not telling you before.

Thanks!

Everything green,except port binding (I already had agent service binding on that port), and file syncing. But I guess the problems are elsewhere. Will investigate later, the main bug didn't happen for 12 hours, so I guess it's ok.

  375 passing (2m)
  1 pending
  4 failing

  1) Azk docker module, run method @slow should support bind ports:
     Error: HTTP code is 500 which indicates error: server error - Cannot start container 9d1ebc5102895ccfdc2dbef55883b986b1e7ffdd19d67449baa3fdb4bf46de40: Error starting userland proxy: listen tcp 0.0.0.0:32777: bind: address already in use

      at _stream_readable.js:944:16

2) Azk sync, Worker module should not include content patterns files from except_from option:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492228zn1s/bar/Fred.txt' not to match /bar\/Fred.txt/
      at Test.callee$1$1$ (/azk:0.15.0/spec/sync/worker_spec.js:216:27)

  3) Azk sync, Worker module should exclude the .gitignore content for default:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492n1w3zrk/ignored/Fred.txt' not to match /ignored\/Fred.txt/
      at Test.callee$1$2$ (/azk:0.15.0/spec/sync/worker_spec.js:241:27)

  4) Azk sync, Worker module should exclude the .syncignore content for default in preference to .gitignore:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-54492cqjv6qo/foo/Moe.txt' to match /ignored\/Fred.txt/
      at Test.callee$1$3$ (/azk:0.15.0/spec/sync/worker_spec.js:267:23)

hmm.. which version of rsync are you using? Those errors are odd.

Running on OSX. I didn't stop my services or vm when running tests, is this required ?

$ rsync --version
rsync  version 3.1.1  protocol version 31
Copyright (C) 1996-2014 by Andrew Tridgell, Wayne Davison, and others.
Web site: http://rsync.samba.org/
Capabilities:
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, no prealloc, file-flags

Same thing happened at our CI box. I'm going to dig it deeper and ping you back ASAP.

I ran it again and now only one error. It seems it is an intermittent error.

1) Azk sync, Worker module should not include content patterns files from except_from option:
     AssertionError: expected '/private/var/folders/f5/bdtz6z3n4ns8mpkp1npyhj6c0000gn/T/azk-test-562383t7d56o/ignored/Fred.txt' not to match /ignored\/Fred.txt/
      at Test.callee$1$1$ (/azk:0.15.0/spec/sync/worker_spec.js:215:27)

Yes, I've run it again in the CI box and everything passed... It's an intermittent error :/
Well, don't worry, I don't think your changes introduced that but I'll take a closer look on this when I have a chance.

Thank you very much for your PR, we do appreciate.

Thank you for your help, too :)