Sporadic fleet launch errors: Unit failed to load: No such file or directory
tleyden opened this issue · 1 comments
I'm running CoreOS alpha 612.1.0 and launching / destroying units via the fleet REST api. I'm seeing sporadic issues where units fail to start.
Here's a full walkthrough of what I'm doing to reproduce the issue:
Start units
- Launch a 3 node cluster on ec2 using this cloudformation template
- Launch fleet units by running
sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd"
-- this dynamically generates fleet units based on templates, then submits them via the fleet api.
At this point, my journalctl -b -u fleet.service --no-pager
logs are:
- http://filebin.ca/1uTCRimrs9g9/fleet_logs_post_start_machine_11.text
- http://filebin.ca/1uTCVmR1nmrx/fleet_logs_post_start_machine_35.text
- http://filebin.ca/1uTCYHIMDfir/fleet_logs_post_start_machine_12.text
Stop + destroy units
This just stops and destroys all units, its essentially the equivalent of fleetctl stop * && fleetctl destroy *
:
sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet stop --all-units && sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet destroy --all-units
Verify everything is clean:
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
$ fleetctl list-unit-files
UNIT HASH DSTATE STATE TARGET
At this point, my journalctl -b -u fleet.service --no-pager
logs are:
- http://filebin.ca/1uTDrlbj6hIn/fleet_logs_post_stop_machine_11.text
- http://filebin.ca/1uTDsOK3i2sE/fleet_logs_post_stop_machine_35.text
- http://filebin.ca/1uTDtx6tDegc/fleet_logs_post_stop_machine_12.text
Restart units
Run the same command as earlier to kick things off: sudo docker run --net=host tleyden5iwx/couchbase-cluster-go update-wrapper couchbase-fleet launch-cbs --version 3.0.1 --num-nodes 3 --userpass "user:passw0rd"
Didn't reproduce the bug this time, but I repeated the Stop + destroy units and Restart units steps three times (third time a charm!) and was able to reproduce it.
Fleet units:
$ fleetctl list-units
UNIT MACHINE ACTIVE SUB
couchbase_node@1.service 8995d6d7.../10.156.7.12 active running
couchbase_node@2.service ad8cb97d.../10.239.174.35 active running
couchbase_node@3.service cc2b61a5.../10.141.247.11 active running
couchbase_sidekick@1.service 8995d6d7.../10.156.7.12 failed failed
couchbase_sidekick@2.service ad8cb97d.../10.239.174.35 active running
couchbase_sidekick@3.service cc2b61a5.../10.141.247.11 failed failed
Unit files:
$ fleetctl list-unit-files
UNIT HASH DSTATE STATE TARGET
couchbase_node@1.service 97550db launched launched 8995d6d7.../10.156.7.12
couchbase_node@2.service 97550db launched launched ad8cb97d.../10.239.174.35
couchbase_node@3.service 97550db launched launched cc2b61a5.../10.141.247.11
couchbase_sidekick@1.service 3c37fd3 launched launched 8995d6d7.../10.156.7.12
couchbase_sidekick@2.service f438171 launched launched ad8cb97d.../10.239.174.35
couchbase_sidekick@3.service 5c1369d launched launched cc2b61a5.../10.141.247.11
Journalctl logs:
- fleet_logs_post_restart_machine_11.text
- fleet_logs_post_restart_machine_35.text
- fleet_logs_post_restart_machine_12.text
Analyzing the logs
On machine 11, which has one of the failed units, there is an error:
ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@3.service start: Unit couchbase_sidekick@3.service failed to load: No such file or directory.
Likewise on machine 12 which also has a failed unit, there is an identical error:
ERROR manager.go:136: Failed to trigger systemd unit couchbase_sidekick@1.service start: Unit couchbase_sidekick@1.service failed to load: No such file or directory.