netdata/helmchart

"alarms" persistence flag affects netdata cloud

Closed this issue · 35 comments

This Github issue is synchronized with Zendesk:

Ticket ID: #20
Group: Support
Requester: rahimian.pegah@gmail.com
Assignee: Zack
Issue escalated by: Christopher Akritidis

Original ticket description:

Hi,
I have trouble signing in to my netdata cloud.
I am using the master-slave functionality of netdata. But I cannot see the nodes properly and slave nodes are getting disconnected.

Would you please kindly help me how can I solve this issue?

Regards,
Pegah

Comment made from Zendesk by Zack on 2020-04-28 at 14:03:

Hi,
You probably have a setting in your browser that prevents the required cookie from being stored. If you copy the magic link URL, and paste it manually in a new tab in your browser with the developer console showing, you will probably see an error. Please give us a screenshot of any such errors and the browser version, so we can help further.

Thanks,
Zack

Comment made from Zendesk by Pegah Rahimian on 2020-04-28 at 14:19:

Thanks for your reply,
Now I can sign in with github. But also there are too many nodes there and the worker 1 , 2, 3, cannot be recognizes. That is relly strange as I cannot access the slaves in netdata. It happened after I turned on TLS.
Would you kindly help me how can I resolve this issue?

Thanks,
Pegah

Comment made from Zendesk by Zack on 2020-04-28 at 14:32:

Please report a bug in Github here https://github.com/netdata/netdata/issues/new/choose and provide as much detail as you can about your systems and setup.

Comment made from Zendesk by Zack on 2020-04-28 at 14:32:

Christopher Akritidis​ is this related to this one bug about masters/slaves and not seeing slave nodes?

Comment made from Zendesk by Christopher Akritidis on 2020-04-29 at 10:12:

Having it happen when TLS is turned on is key. I expect that the URL was http and now it should be https. So he could perhaps fix it by deleting some URLs from his visited nodes.

Comment made from Zendesk by Christopher Akritidis on 2020-04-29 at 10:13:

I can show you exactly what I mean, if you want.

Comment made from Zendesk by Costa Tsaousis on 2020-04-30 at 07:45:

Has anyone answered this?

On Tue, Apr 28, 2020 at 12:21 PM Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi,
I have trouble signing in to my netdata cloud.
I am using the master-slave functionality of netdata. But I cannot see the nodes properly and slave nodes are getting disconnected.

Would you please kindly help me how can I solve this issue?

Regards,
Pegah


--

Costa Tsaousis
Founder & CEO

Monitor everything  ...in real-time!

GR Mobile: +306945494510 (also on @WhatsApp, @Viber, @telegram)
US Mobile: +1-(650)-885-8371
LinkedIn profilektsaou @github, ktsaou @skype

Comment made from Zendesk by Zack on 2020-04-30 at 13:17:

Yes, and I have not heard back yet.

Comment made from Zendesk by Zack on 2020-04-30 at 17:30:

Hi Pegah,

When you turn on TLS you may have to check your visited nodes and make sure you are connecting to them using https:// instead of http:// since the old URLs will no longer be reachable.

Please let us know if you are still having this issue!

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:41:

Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:
image.png
and then, wrong machine guide for worker 1 and worker 3:
image.png
and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:42:

image.png


On Thu, 30 Apr 2020 at 19:41, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:
image.png
and then, wrong machine guide for worker 1 and worker 3:
image.png
and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:43:

There is no TLS. I just have this issue without TLS setting.

On Thu, 30 Apr 2020 at 19:42, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
image.png


On Thu, 30 Apr 2020 at 19:41, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:
image.png
and then, wrong machine guide for worker 1 and worker 3:
image.png
and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Comment made from Zendesk by Zack on 2020-04-30 at 17:47:

Comment made from Zendesk by Zack on 2020-04-30 at 17:53:

ignore, i think the prev linked issue is not the right one (and i can't edit that comment)

Comment made from Zendesk by Zack on 2020-04-30 at 17:55:

If you made a Github issue, can you link me that issue?

Also, what is your Netdata version? You can get that with "netdata -v"

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:57:

netdata v1.20.0-278-nightly

and here is the issue:



Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 18:01:

probably you can access UI?


although we are reinstalling k8s in few hours to check if it will solve the issue

On Thu, 30 Apr 2020 at 19:57, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
netdata v1.20.0-278-nightly

and here is the issue:



Comment made from Zendesk by Zack on 2020-04-30 at 19:19:

I cannot access the UI.

My current version is netdata v1.21.1 - do you have an update available?

I went through the github issue.  When you reinstall k8s are you going to redeploy/update your helm charts?

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 20:03:

I also tried v21. But same issue.
for reinstalling, I will use "helm delete netdata", then: " helm install netdata ./path_to_hem_chart"
 
I assume it should be delete and purge as Im using helm3. But you as you see in the screenshot, I can see a lot of previous unreachable nodes. I dont understand how those nodes are replicated.

Thanks,
Pegah

Comment made from Zendesk by Zack on 2020-04-30 at 20:17:

Ok. Let me know when you reinstall, then we can add some details to the github issue if that doesn't fix it.

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 20:19:

Thanks a lot for your help. I will contact you tomorrow,

Pegah

Comment made from Zendesk by Christopher Akritidis on 2020-05-01 at 11:19:

I know where the issue is coming from, will write on GitHub.

Comment made from Zendesk by Christopher Akritidis on 2020-05-01 at 12:05:

Ok, I saw the discussion in 8847 and it has to do with streaming, which is unrelated to this report. I will unlink it, so it doesn't get confusing.

The issue here is caused by an confusing link between the "alarm persistence" and the netdata instance unique identifier (MACHINE_GUID), which is not apparent in the documentation for values.yaml. `/var/lib/netdata` is used to store both the alarms log and the `registry` information (specifically netdata.registry.unique.id).  So we'll need to update our helm chart, to make it much clearer. We'll possibly need to force /var/lib/netdata to always use persistent storage, but that will be discussed in an issue. 

What you can do from your side as a workaround, to prevent new netdata masters from appearing is to turn the master alarms persistence flag to true and provide a storageclass that will work in your environment. 

We basically have three options here:

  • Provide more options for persistence, that distinguish between the various files stored under /var/lib/netdata. A claim.d directory is getting added there as well. I'm not fond of this option, because it means we'll need to always add new options and require new volumes every time we add a new file or directory there.
  • Change the docs and the config files so that it's clear it's not just alarms that supposedly use the persistent storage for /var/lib/netdata but also key things like the machine guid, the claiming information and the health management API key. We strongly recommend that users persist this volume and only disable it if they don't want to use netdata cloud.
  • Just demand at least one persistent volume, to be used for /var/lib/netdata, removing completely the option to disable it.

Would like to hear your views.

Comment made from Zendesk by Pegah Rahimian on 2020-05-03 at 19:26:

Hi,
Thanks for checking. We have re-installed k8s, and installed netdata with helm chart (v21) again. I have also set the persistent flags for alarm and database to true. But we still cannot access the slaves immediately. There are only 2 nodes: "netdata-master-0 & master"  So we still have streaming problem for slaves. And this is the log I receive from all workers:

netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again (errno 22, Invalid argument)

Would you please kindly help me how to solve the streaming issue? Does it have to anything with localStorages?

Comment made from Zendesk by Zack on 2020-05-03 at 20:57:

I cannot access the address you provided, is it internal?

The helm chart was updated recently (1.2.0 for 1.21.0).

For now I made a new helm chart issue here: #94 in case it's a problem with how the services are set up.

Comment made from Zendesk by Zack on 2020-05-04 at 15:34:

Hi Pegah,

Just to keep you updated: We've decided to continue the discussion in issue 8847. There are a few things for us to test out - thank you for reporting this, you are helping make Netdata better!
Have a look at the latest comments and let me know if some of those suggestions work for you.

Comment made from Zendesk by Pegah Rahimian on 2020-05-04 at 21:05:

Hi,
Thanks, here is what we see:

1) This issue is not concerning us as we are not using netdata cloud. So thats not a big deal. I have just asked you as it was un normal to me.

2)  we want to encrypt intra-cluster communication, but we could not set TLS master-slave successfully yet

That will be great if you let me know the result of your testing, as we are still in trouble of streaming configuration, and I don’t have any idea how to reach out all the slaves after enabling TLS. Are you able to do that while your testing phase?

Thanks,
Pegah

Comment made from Zendesk by Zack on 2020-05-04 at 22:21:

I will let you know once our testing is complete. If we run into unexpected problems, the discussion will be in the issue I linked previously.

Comment made from Zendesk by Pegah Rahimian on 2020-05-19 at 22:54:

Dear Zack,

Is there someone who I can ask how to enable weblog module in netdata helm chart?
I want to see upstream_response_time chart in netdata. (probably for nginx).
I have enabled go.d plugin in values.yaml, but still no upstream_response_time chart in UI.
Do I need any other configuration?

Thanks again,
Pegah

Comment made from Zendesk by Zack on 2020-05-20 at 07:29:

You need to configure both nginx and Netdata and make sure to restart both. Also see https://learn.netdata.cloud/docs/agent/collectors/python.d.plugin/nginx/

I will look into the specific value and the helm chart and let you know.

Comment made from Zendesk by Zack on 2020-05-20 at 16:50:

Closing in favor of #114