coreos/fleet

Sporadic fleet launch errors: Unit failed to load: No such file or directory

tleyden opened this issue · 1 comments

I'm running CoreOS alpha 612.1.0 and launching / destroying units via the fleet REST api. I'm seeing sporadic issues where units fail to start.

Here's a full walkthrough of what I'm doing to reproduce the issue:

Start units

  • Launch a 3 node cluster on ec2 using this cloudformation template
  • Launch fleet units by running sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd" -- this dynamically generates fleet units based on templates, then submits them via the fleet api.

At this point, my journalctl -b -u fleet.service --no-pager logs are:

Stop + destroy units

This just stops and destroys all units, its essentially the equivalent of fleetctl stop * && fleetctl destroy *:

sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet stop --all-units && sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet destroy --all-units

Verify everything is clean:

$ fleetctl list-units
UNIT    MACHINE ACTIVE  SUB
$ fleetctl list-unit-files
UNIT    HASH    DSTATE  STATE   TARGET

At this point, my journalctl -b -u fleet.service --no-pager logs are:

Restart units

Run the same command as earlier to kick things off: sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd"

Didn't reproduce the bug this time, but I repeated the Stop + destroy units and Restart units steps three times (third time a charm!) and was able to reproduce it.

Fleet units:

$ fleetctl list-units
UNIT                MACHINE             ACTIVE  SUB
couchbase_node@1.service    8995d6d7.../10.156.7.12     active  running
couchbase_node@2.service    ad8cb97d.../10.239.174.35   active  running
couchbase_node@3.service    cc2b61a5.../10.141.247.11   active  running
couchbase_sidekick@1.service    8995d6d7.../10.156.7.12     failed  failed
couchbase_sidekick@2.service    ad8cb97d.../10.239.174.35   active  running
couchbase_sidekick@3.service    cc2b61a5.../10.141.247.11   failed  failed

Unit files:

$ fleetctl list-unit-files
UNIT                HASH    DSTATE      STATE       TARGET
couchbase_node@1.service    97550db launched    launched    8995d6d7.../10.156.7.12
couchbase_node@2.service    97550db launched    launched    ad8cb97d.../10.239.174.35
couchbase_node@3.service    97550db launched    launched    cc2b61a5.../10.141.247.11
couchbase_sidekick@1.service    3c37fd3 launched    launched    8995d6d7.../10.156.7.12
couchbase_sidekick@2.service    f438171 launched    launched    ad8cb97d.../10.239.174.35
couchbase_sidekick@3.service    5c1369d launched    launched    cc2b61a5.../10.141.247.11

Journalctl logs:

Analyzing the logs

On machine 11, which has one of the failed units, there is an error:

ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@3.service start: Unit couchbase_sidekick@3.service failed to load: No such file or directory.

Likewise on machine 12 which also has a failed unit, there is an identical error:

ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@1.service start: Unit couchbase_sidekick@1.service failed to load: No such file or directory.