manala/ansible-roles

[manala.telegraf] service unable to start during initial provisioning

lisuml opened this issue · 5 comments

manala.roles version: 3.2.0

During an initial provisioning of the node with manala.telegraf role attached, the service is not being started properly:

TASK [manala.roles.telegraf : Configs > Templates present] ****************************************************************************************************************************************************************************************************
changed: [d-test.euc1.XXX.lan] => (item={'state': 'present', 'template': 'configs/_default.j2', 'file': '/etc/telegraf/telegraf.d/os.conf', 'config': '[[inputs.cpu]]\n  totalcpu = true\n[[inputs.disk]]\n  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]\n[[inputs.diskio]]\n[[inputs.kernel]]\n[[inputs.mem]]\n[[inputs.net]]\n[[inputs.netstat]]\n[[inputs.processes]]\n[[inputs.system]]\n'})
changed: [d-test.euc1.XXX.lan] => (item={'state': 'present', 'template': 'configs/_default.j2', 'file': '/etc/telegraf/telegraf.d/output.conf', 'config': '[[outputs.influxdb]]\n  urls = [ "udp://metrix.euc1.XXX.lan:8089" ]\n  udp_payload = "1024B"\n'})

TASK [manala.roles.telegraf : Configs > Files absent] *********************************************************************************************************************************************************************************************************

TASK [manala.roles.telegraf : Services > Services] ************************************************************************************************************************************************************************************************************
failed: [d-test.euc1.XXX.lan] (item=telegraf) => {"ansible_loop_var": "item", "changed": false, "item": "telegraf", "msg": "Unable to start service telegraf: Job for telegraf.service failed because the control process exited with error code.\nSee \"systemctl status telegraf.service\" and \"journalctl -xe\" for details.\n"}

As you can see, the configs are defined properly, but it seems they are not ready on service start.
The error I see in systemd:

Jan 17 13:40:15 d-test.euc1.XXX.lan telegraf[8968]: 2023-01-17T13:40:15Z E! [telegraf] Error running agent: no outputs found, did you provide a valid config file?
Jan 17 13:40:15 d-test.euc1.XXX.lan systemd[1]: telegraf.service: Main process exited, code=exited, status=1/FAILURE

During the 2nd provisioning attempt, the error is gone and the service starts normally.

More investigation made and it seems the issue is only present with telegraf 1.25.0 (most recent one at the moment).

The issue is caused by the fact, the official debian packages provided by influxdata automatically try to start the telegraf systemd service on installation time and the working configuration for the outputs is expected to be part of the config file at that time, but the outputs configuration is not there.

This looks like a bug of telegraf itself or/and telegraf official debian packages. I'm going to file an github issue on the official telegraf repository.

For me, the workaround was simply to pick lower version of the telegraf to install from ansible playbook:

manala_telegraf_install_packages_default:
      - telegraf=1.24.4-1
nervo commented

@lisuml we ran on the same issue on v1.25.0 and fixed our tests like that #642

Would you provide all your values passed to the role ?

btw, use manala_telegraf_install_packages instead of manala_telegraf_install_packages_default:)

@nervo: thanks for the followup!

Would you provide all your values passed to the role ?

These are my ansible variables:

    manala_apt_preferences:
      - influxdb@influxdata
    manala_telegraf_install_packages:
      - telegraf=1.24.4-1
    manala_telegraf_config_template: config/telegraf/base/telegraf.conf.j2
    manala_telegraf_config:
      global_tags:
        environment: "{{ env }}"
    manala_telegraf_configs:
      - file: os.conf
        config: |
          [[inputs.cpu]]
            totalcpu = true
          [[inputs.disk]]
            ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
          [[inputs.diskio]]
          [[inputs.kernel]]
          [[inputs.mem]]
          [[inputs.net]]
          [[inputs.netstat]]
          [[inputs.processes]]
          [[inputs.system]]
      - file: output.conf
        config: |
          [[outputs.influxdb]]
            urls = [ "udp://metrix.euc1.XXX.lan:8089" ]
            udp_payload = "1024B"

use manala_telegraf_install_packages instead of manala_telegraf_install_packages_default

Roger that.

FYI: I created an issue in telegraf github repo: influxdata/telegraf#12514

nervo commented

Ok, so let's wait for the next telegraf version :)

(btw, you should also use explicit telegraf apt preference)

        manala_apt_preferences:
          - telegraf@influxdata

(btw, you should also use explicit telegraf apt preference)

My bad. Thanks for pointing this out!