Tendrl/tendrl-ansible

Sometimes monitoring-integration is not up

GowthamShanmugam opened this issue · 7 comments

Problem is monitoring-integration is depended on grafana-server service, tendrl-ansible restarting monitoring-integration and then it restarting grafana-server. So sometimes monitoring-integration stops because grafana is in restart mode.
So we need to change the order of restart like restart grafana-server first and monitoring-integration second.

I encountered an issue where tendrl-monitoring-integration is not starting and causes import issues later -- specifically import will fail:

screen shot 2018-07-02 at 2 52 49 pm

Note: Unmanage cluster will also fail if tendrl-monitoring-integration is not running.

Shouldn't tendrl-ansible verify that all the tendrl-related services are correctly up and running?

@GowthamShanmugam

I have tried rebooting, manually stopping the grafana-server and attempting to start it manually, i.e. restart grafana-server first before starting tendrl-monitoring-agent, per the https://github.com/Tendrl/documentation/wiki/Tendrl-release-v1.6.3-(install-guide) instructions, and I still cannot get tendrl-monitoring-integration to start and stay running.

$ rpm -qa | grep tendrl | sort

tendrl-api-1.6.3-20180626T110501.5a1c79e.noarch
tendrl-api-httpd-1.6.3-20180626T110501.5a1c79e.noarch
tendrl-commons-1.6.3-20180628T114340.d094568.noarch
tendrl-grafana-plugins-1.6.3-20180622T070617.1f84bc8.noarch
tendrl-grafana-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-monitoring-integration-1.6.3-20180622T070617.1f84bc8.noarch
tendrl-node-agent-1.6.3-20180618T083110.ba580e6.noarch
tendrl-notifier-1.6.3-20180618T083117.fd7bddb.noarch
tendrl-selinux-1.5.4-20180227T085901.984600c.noarch
tendrl-ui-1.6.3-20180625T085228.23f862a.noarch

$ systemctl status tendrl-monitoring-integration -l

● tendrl-monitoring-integration.service - Monitoring Integration
   Loaded: loaded (/usr/lib/systemd/system/tendrl-monitoring-integration.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2018-07-02 19:23:49 UTC; 5min ago
     Docs: https://github.com/Tendrl/monitoring-integration/tree/master/doc/source
  Process: 5697 ExecStart=/usr/bin/tendrl-monitoring-integration (code=exited, status=1/FAILURE)
 Main PID: 5697 (code=exited, status=1/FAILURE)

Jul 02 19:23:48 tendrl-server systemd[1]: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE
Jul 02 19:23:48 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jul 02 19:23:48 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.
Jul 02 19:23:49 tendrl-server systemd[1]: tendrl-monitoring-integration.service holdoff time over, scheduling restart.
Jul 02 19:23:49 tendrl-server systemd[1]: start request repeated too quickly for tendrl-monitoring-integration.service
Jul 02 19:23:49 tendrl-server systemd[1]: Failed to start Monitoring Integration.
Jul 02 19:23:49 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jul 02 19:23:49 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.

tail from "journalctl -u tendrl-monitoring-integration" output:

Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: Traceback (most recent call last):
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: File "/usr/bin/tendrl-monitoring-integration", line 9, in <module>
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: load_entry_point('tendrl-monitoring-integration==1.6.3', 'console_scripts', 'tendrl-monitoring-integration')()
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 71, in main
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: monitoring_integration_manager.start()
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 31, in start
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: dashboard.upload_default_dashboards()
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py", line 27, in upload_default_dashboards
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: raise ex
Jun 30 18:06:10 tendrl-server tendrl-monitoring-integration[10346]: KeyError: 'id'
Jun 30 18:06:10 tendrl-server systemd[1]: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE
Jun 30 18:06:10 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jun 30 18:06:10 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.
Jun 30 18:06:10 tendrl-server systemd[1]: tendrl-monitoring-integration.service holdoff time over, scheduling restart.
Jun 30 18:06:10 tendrl-server systemd[1]: start request repeated too quickly for tendrl-monitoring-integration.service
Jun 30 18:06:10 tendrl-server systemd[1]: Failed to start Monitoring Integration.
Jun 30 18:06:10 tendrl-server systemd[1]: Unit tendrl-monitoring-integration.service entered failed state.
Jun 30 18:06:10 tendrl-server systemd[1]: tendrl-monitoring-integration.service failed.

I've also posted this similar info to Tendrl/ui#995.

I should also mentioned I've tried to start tendrl-monitoring-integration multiple times, and it starts for a short while and then it dies. Systemctl keeps trying to restart it but it seems to run for a short while and then dies.

Per @mbukatov from an email thread "I'm not sure if this itself would work good enough, as the restart itself would not fix the problem in most cases.

If the monitoring integration can get into state that it crashes, is restarted and it's not up again, it's either:

  • bug in systemd service file
  • but in monitoring integration, which node agent wouldn't be able to resolve"

@nthomas-redhat @Tendrl/qe @GowthamShanmugam

Mentioning @anmolsachan as he mentioned seeing this same issue awhile back.

My setup is using tendrl-vagrant with 1 tendrl server, 3 tendrl nodes (gluster nodes). The Grafana password in /etc/tendrl/monitoring-integration/monitoring-integration.conf.yaml is properly set to some long string that systems specifies.

Thoughts?

Excerpt from /var/log/messages:

Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Setup TendrlContext for namespace.tendrl
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Load definitions (.yml) for namespace.tendrl.objects.TendrlContext
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Traceback (most recent call last):
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: File "/usr/bin/tendrl-monitoring-integration", line 9, in <module>
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: load_entry_point('tendrl-monitoring-integration==1.6.3', 'console_scripts', 'tendrl-monitoring-integration')()
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 71, in main
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: monitoring_integration_manager.start()
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/manager/__init__.py", line 31, in start
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: dashboard.upload_default_dashboards()
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: File "/usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py", line 27, in upload_default_dashboards
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: raise ex
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: KeyError: 'id'
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: tendrl-monitoring-integration.service: main process exited, code=exited, status=1/FAILURE
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: Unit tendrl-monitoring-integration.service entered failed state.
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: tendrl-monitoring-integration.service failed.
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 journal: 2018-06-29 21:17:02.171715+00:00 - monitoring_integration - /usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py:26 - upload_default_dashboards - ERROR - Invalid username or password
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-node-agent: 2018-06-29 21:17:02.171715+00:00 - monitoring_integration - /usr/lib/python2.7/site-packages/tendrl/monitoring_integration/grafana/dashboard.py:26 - upload_default_dashboards - ERROR - Invalid username or password
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: tendrl-monitoring-integration.service holdoff time over, scheduling restart.
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: Started Monitoring Integration.
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 systemd: Starting Monitoring Integration...
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Creating namespace.monitoring from source tendrl.monitoring_integration
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: namespace.monitoring created!
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Finding objects in namespace.monitoring.objects
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Registering object namespace.monitoring.objects.AlertOrganization
Jun 29 21:17:02 ibm-p8-kvm-03-guest-02 tendrl-monitoring-integration: Finding atoms in namespace.monitoring.objects.AlertOrganization.atoms

@julienlim Actually I haven't faced the issue. I suspected that the issue might be in tendrl-monitoring integration.