starting an existing collector service causes a non-0 exit with systemctl

Question

starting an existing collector service causes a non-0 exit with systemctl

Opened this issue 7 years ago · 24 comments

dannenberg commented 7 years ago

Environment Information

Please include the following:

version of cookbook: 1.2.21
version of chef: 12.21.3
operating system: centos 7 (7.2.1511)
resources that are affected by issue: chef-client

Expected Behavior

What should have happened?
chef-client should run to completion regardless of the state of the collector service

Actual Behavior

What actually happened?
chef-client exits with an error due to this line https://github.com/SumoLogic/sumologic-collector-chef-cookbook/blob/master/recipes/default.rb#L44 which is attempting to start an existing, running collector. The error output is included below:

Recipe: sumologic-collector::default
  * service[collector] action start

    ================================================================================
    Error executing action `start` on resource 'service[collector]'
    ================================================================================

    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /bin/systemctl --system start collector ----
    STDOUT:
    STDERR: Job for collector.service failed because the control process exited with error code. See "systemctl status collector.service" and "journalctl -xe" for details.
    ---- End output of /bin/systemctl --system start collector ----
    Ran /bin/systemctl --system start collector returned 1

    Resource Declaration:
    ---------------------
    # In /var/chef/cache/cookbooks/sumologic-collector/recipes/default.rb

     44:     service 'collector' do
     45:       action :start
     46:     end
     47:   else

    Compiled Resource:
    ------------------
    # Declared in /var/chef/cache/cookbooks/sumologic-collector/recipes/default.rb:44:in `from_file'

    service("collector") do
      action [:start]
      supports {:restart=>nil, :reload=>nil, :status=>nil}
      retries 0
      retry_delay 2
      default_guard_interpreter :default
      service_name "collector"
      enabled nil
      running nil
      masked nil
      pattern "collector"
      declared_type :service
      cookbook_name "sumologic-collector"
      recipe_name "default"
    end

    System Info:
    ------------
    chef_version=12.21.3
    platform=centos
    platform_version=7.2.1511
#
# Copyright 2015-2016, Noah Kantrowitz
#
    ruby=ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-linux]
    program_name=chef-client worker: ppid=10420;start=16:22:07;
    executable=/opt/chef/bin/chef-client


Running handlers:
  - Chef::Handler::BugsnagReporter
Running handlers complete
Chef Client failed. 1 resources updated in 05 seconds

Steps to Reproduce

Please list the steps required to reproduce the issue.
We are using terraform to spin up our instances
The initial chef-client run complete successfully, but the next one throws that error.
NB: If we stop the collector service and run chef-client again it completes successfully from then on.

Important Factoids

Are there anything atypical about the setup that we should know? For example everything goes through an HTTP/HTTPS proxy or I do not use rabbitmq.
NB: If we stop the collector service and run chef-client again it completes successfully from then on.

References

Are there any other GitHub issues (open or closed) or Pull Requests that should be linked here? For example:

GH-1234

Answer 1 · 2017-07-31T21:03:17.000Z

If you could explain to my how this block works normally, I may be able to debug this further on my own. I'm not clear on why this block wouldn't throw an error during a normal run where the collector service is already running?

In both cases (when chef-client runs with the collector service running succeeds and when it fails) service collector start appears to return 1. Is there some other attribute that changes to avoid the block entirely?

Answer 2 · 2017-08-03T23:19:46.000Z

This is not a chef problem, if a service start returns an error then that is a problem with the init script regardless of which subsystem is involved.

We could try working around this temporarily by switching from systemd to sysvinit defaults but that is not really a real solution but a work around. I would opt for allowing the init subsystem to be specified but for very different reasons. Giving people the choice to choose their init subsystem is important because not everyone is hopping on the systemd bandwagon for reasons I will not go into here as my intention is not to start a systemd religious war as they are rarely based on facts and are more often driven by emotions.

Answer 3 · 2017-08-18T19:30:11.000Z

Technically this is not our problem but it does not look like sumo is gonna solve it any time soon. I am gonna put together a PR this weekend and will probably want to get @dannenberg to validate that it fixes his scenario.

Answer 4 · 2017-09-12T19:40:57.000Z

I have merged this and am awaiting @duchatran to cut a release. This does not fix the issue reported so I am not sure if I want to leave this open as it's really not a chef issue. The "fix" allows someone to work around the problem but it is not really solved because we can't solve it since we do not manage the init scripts in this cookbook.

Answer 5 · 2017-09-12T22:54:20.000Z

If you could explain to my how this block works normally, I may be able to debug this further on my own. I'm not clear on why this block wouldn't throw an error during a normal run where the collector service is already running?

Well basically the first time it installs (assuming you have not pre-created this dir) this will not run: https://github.com/SumoLogic/sumologic-collector-chef-cookbook/blob/v1.2.22/recipes/default.rb#L31-L52 I think it is to avoid reinstalling (and modifying config possibly which does seem wrong, I would have to look deeper but am not terribly motivated at the moment and would likely just wait until I have time to refactor the whole cookbook) every time since they do not version the deb/rpm files so there is no way to know if we need to upgrade. This cookbook does not attempt to upgrade collectors (something I find sorta annoying).

Like I said though this has nothing to do with the issue at hand, they have a bad systemd unit template and until they fix it all I can say is use sysvinit.

Answer 6 · 2017-10-24T16:12:51.000Z

Wont this section of code just make that last PR not apply?
https://github.com/SumoLogic/sumologic-collector-chef-cookbook/blob/master/providers/default.rb#L169-L170

If systemd is available, it's going to use it, even if you try to override it with the init_style attribute.

Answer 7 · 2017-10-24T16:19:13.000Z

Hmm the short answer is sorta? When using the recipes it will indeed use the requested subsystem as the call is to service rather than sumo_service but I bet you are right if you are just using the LWRPs directly.

I confirmed with the original user it allows them to work around their issue.

Answer 8 · 2017-10-24T16:34:35.000Z

I'm just using the sumologic_collector LWRP. This cookbook has sporadically broken my provisioning pipeline for weeks now. Have you got any tips for getting things working again? I understand we can point fingers at sumo, but at this point, I just need to find some hack to get things functional again.

Answer 9 · 2017-10-24T16:36:12.000Z

We can do a similar change to the lwrp to allow you to be unblocked. I can hopefully get to it tonight.

Answer 10 · 2017-10-24T17:20:21.000Z

@acranox can you try testing out #148 and let me know if this works for you?

Answer 11 · 2017-10-27T13:48:45.000Z

this whole thing may be related to this chef-client change.
chef/chef#6157

Answer 12 · 2017-10-27T14:42:22.000Z

I don't think that is the root cause it may cause it to more often than not favor systemd. That does not account for why sending a start to an already running unit should do anything other than detect that it is already started, print a message to the user indicating such, and exit 0.

Answer 13 · 2017-10-27T15:02:51.000Z

Could someone give me the output of: systemctl show -p MainPID collector.service I am wondering if there are multiple processes being launched and systemd is confused as to which one is the main one.

Answer 14 · 2017-10-27T15:10:55.000Z

$ sudo systemctl show -p MainPID collector.service
MainPID=0

in my case, I was installing the collector, and then enabling/starting the service (which would fail because the collector was already running. And after that happened, checking the status of the service with systemctl returned this:

# systemctl status collector
● collector.service - LSB: @app.long.name@
   Loaded: loaded (/etc/rc.d/init.d/collector; bad; vendor preset: disabled)
   Active: active (exited) since Tue 2017-10-24 15:15:57 UTC; 33s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 4177 ExecStart=/etc/rc.d/init.d/collector start (code=exited, status=0/SUCCESS)

Oct 24 15:15:53 server01 systemd[1]: Starting LSB: @app.long.name@...
Oct 24 15:15:53 server01 collector[4177]: Starting SumoLogic Collector...
Oct 24 15:15:57 server01 collector[4177]: Waiting for SumoLogic Collector......
Oct 24 15:15:57 server01 collector[4177]: running: PID:4240
Oct 24 15:15:57 server01 systemd[1]: Started LSB: @app.long.name@.

Answer 15 · 2017-10-27T17:06:51.000Z

Hmm if you can hit me up in the sumo community slack maybe we can do some more interactive troubleshooting. I will be available after noon pacific time today.

Answer 16 · 2018-10-10T14:08:06.000Z

Hi guys, just wondering if this issue is fixed as I am still running the same errors. Any ideas? Any help is welcome or ideas on how to work around this problem. Thank you.

Answer 17 · 2018-10-11T06:13:00.000Z

I am not using sumologic at my current job but you can work around it by specifying the init style via an attribute: #145

Answer 18 · 2019-01-18T08:37:23.000Z

I was running into this issue also. The systemd unit file given as an example doesn't work. Also the one installed with the deb file didn't work for me either.

It fails with some opaque messages

root@host2:/etc/systemd/system# journalctl -u collector
-- Logs begin at Fri 2019-01-18 07:51:48 UTC, end at Fri 2019-01-18 08:32:01 UTC. --
Jan 18 08:28:25 host2 systemd[1]: Stopped SumoLogic Collector.
Jan 18 08:31:05 host2 systemd[1]: Starting SumoLogic Collector...
Jan 18 08:31:05 host2 collector[3237]: Starting SumoLogic Collector...
Jan 18 08:31:09 host2 collector[3237]: Waiting for SumoLogic Collector......
Jan 18 08:31:09 host2 collector[3237]: running: PID:3296
Jan 18 08:31:09 host2 systemd[1]: Started SumoLogic Collector.
Jan 18 08:31:11 host2 systemd[1]: collector.service: Main process exited, code=exited, status=1/FAILURE
Jan 18 08:31:11 host2 collector[3353]: Stopping SumoLogic Collector...
Jan 18 08:31:11 host2 collector[3353]: SumoLogic Collector was not running.
Jan 18 08:31:11 host2 systemd[1]: collector.service: Unit entered failed state.
Jan 18 08:31:11 host2 systemd[1]: collector.service: Failed with result 'exit-code'.

The above comment gave me a hint.

By adding a PIDFile= directive I'm able to start and stop the collector using systemd.

[Unit]
Description=SumoLogic Collector
After=syslog.target

[Service]
Type=forking
ExecStart=/opt/SumoCollector/./collector start sysd
ExecStop=/opt/SumoCollector/./collector stop sysd
PIDFile=/opt/SumoCollector/collector.pid

[Install]
WantedBy=multi-user.target

Answer 19 · 2019-01-18T21:57:00.000Z

@mbigras interesting I will ping some folks at sumologic to see about them patching their systemd unit file in their packages and documentation.

Answer 20 · 2019-02-07T17:14:52.000Z

I'm having this issue on Amazon Linux 2. The fix by @mbigras doesn't seem to fix the issue like I hoped it would.

What's also odd is that in my kitchen.yml I have RHEL7, CentOS7, and Oracle Linux 7 and all of those work fine; and I confirmed they are using systemd. It just seems to be an issue with Amazon linux 2 when I use a systemd file like @mbigras suggested.

Answer 21 · 2019-02-08T00:38:00.000Z

That is strange, is it the same error as reported? Also can you run:

sudo systemctl show -p MainPID collector.service

If you get a:

MainPID=0

Then it means that the process is not able to launch, this is technically a different problem than the original report as the main report was that sending a start to an existing running service errors out because systemd is confused about where to find the main running process. If this is the case (which I saw at least once above) perhaps running:

sudo systemctl show -p ExecMainPID collector.service

might shed some additional light if the service shows as running as that means that means that the systemd unit file is broken and does not know how to track the PID for the process.

Answer 22 · 2019-02-08T20:39:41.000Z

Yeah I suppose this error might be slightly different then. Here is more info:

           Error executing action `restart` on resource 'service[collector]'
           ================================================================================

           Mixlib::ShellOut::ShellCommandFailed
           ------------------------------------
           Expected process to exit with [0], but received '5'
           ---- Begin output of /bin/systemctl --system restart collector ----
           STDOUT:
           STDERR: Failed to restart collector.service: Unit not found.
           ---- End output of /bin/systemctl --system restart collector ----
           Ran /bin/systemctl --system restart collector returned 5

           Resource Declaration:
           ---------------------
           # In /tmp/kitchen/cache/cookbooks/cn_logging/definitions/default.rb

            73:       service 'collector' do
            74:           action :start
            75:       end
            76:     end

           Compiled Resource:
           ------------------
           # Declared in /tmp/kitchen/cache/cookbooks/cn_logging/definitions/default.rb:73:in `block in from_file'

           service("collector") do
             params {:enable=>true, :disable_sumo=>false, :disable_logrotate=>false, :multiline=>true, :source_name=>"cwp_sdcss-caflog_logs", :source_category=>"development/cwp_sdcss-caflog_logs", :log_names=>"/va
r/log/sdcss-caflog/*.log", :sumo_json_path=>"/opt/sumo-configs", :action=>:enable, :name=>"cwp_sdcss-caflog_logs"}
             action [:start]
             default_guard_interpreter :default
             service_name "collector"
             enabled nil
             running nil
             masked nil
             pattern "collector"
             declared_type :service
             cookbook_name "cwp_agent"
             recipe_name "default"
           end

           System Info:
           ------------
           chef_version=13.4.19
           platform=amazon
           platform_version=2
           ruby=ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-linux]
           program_name=chef-client worker: ppid=3384;start=17:38:40;
           executable=/opt/chef/bin/chef-client


       Running handlers:
       [2019-02-08T17:39:11+00:00] ERROR: Running exception handlers
       [2019-02-08T17:39:11+00:00] ERROR: Running exception handlers
       Running handlers complete
       [2019-02-08T17:39:11+00:00] ERROR: Exception handlers complete
       [2019-02-08T17:39:11+00:00] ERROR: Exception handlers complete
       Chef Client failed. 12 resources updated in 31 seconds
       [2019-02-08T17:39:11+00:00] FATAL: Stacktrace dumped to /tmp/kitchen/cache/chef-stacktrace.out
       [2019-02-08T17:39:11+00:00] FATAL: Stacktrace dumped to /tmp/kitchen/cache/chef-stacktrace.out
       [2019-02-08T17:39:11+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
       [2019-02-08T17:39:11+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
       [2019-02-08T17:39:11+00:00] ERROR: Chef::Exceptions::MultipleFailures
       [2019-02-08T17:39:11+00:00] ERROR: Chef::Exceptions::MultipleFailures
       [2019-02-08T17:39:11+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
       [2019-02-08T17:39:11+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

$ sudo systemctl show -p MainPID collector.service
MainPID=0

$ sudo systemctl show -p ExecMainPID collector.service
ExecMainPID=0

This issues seems to be that the systemd file for collector is missing on amazon linux 2. What's also weird is that all other systemd OSes don't have /etc/init.d/collector but Amazon Linux 2 does.

Regardless I suppose my issue is different than the original issue so you might want to disregard it.

Answer 23 · 2019-08-14T18:35:48.000Z

any updates with Amazon linux 2 issue? i have faced with the same case ?

Answer 24 · 2019-08-14T20:23:41.000Z

any updates with Amazon linux 2 issue? i have faced with the same case ?

I do not, I do not run amazon 2 linux. I suspect that in the case of amazon 2 linux if you are seeing a /etc/init.d/collector that the collector installer itself is detecting that it is not a systemd setup therefore the service is setup as a sys-v init. The chef resource is then attempting to use a systemd unit which does not exist hence the error message. The installer needs to be fixed on sumologic end and not something we can realistically solve as we don't manage that. I suppose we could add an option somewhere and manage the template ourselves in the cookbook but I dont think thats the right approach for the community cookbook even if it might make sense to do this in your wrapper cookbook.