"alarms" persistence flag affects netdata cloud

Question

"alarms" persistence flag affects netdata cloud

Closed this issue 4 years ago · 35 comments

This Github issue is synchronized with Zendesk:

Ticket ID: #20
Group: Support
Requester: rahimian.pegah@gmail.com
Assignee: Zack
Issue escalated by: Christopher Akritidis

Original ticket description:

Hi,
I have trouble signing in to my netdata cloud.
I am using the master-slave functionality of netdata. But I cannot see the nodes properly and slave nodes are getting disconnected.

Would you please kindly help me how can I solve this issue?

Regards,
Pegah

Answer 1 · 2020-05-01T12:06:08.000Z

Comment made from Zendesk by Zack on 2020-04-28 at 14:03:

Hi,
You probably have a setting in your browser that prevents the required cookie from being stored. If you copy the magic link URL, and paste it manually in a new tab in your browser with the developer console showing, you will probably see an error. Please give us a screenshot of any such errors and the browser version, so we can help further.

Thanks,
Zack

Answer 2 · 2020-05-01T12:06:09.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-28 at 14:19:

Thanks for your reply,
Now I can sign in with github. But also there are too many nodes there and the worker 1 , 2, 3, cannot be recognizes. That is relly strange as I cannot access the slaves in netdata. It happened after I turned on TLS.
Would you kindly help me how can I resolve this issue?

Thanks,
Pegah

Answer 3 · 2020-05-01T12:06:10.000Z

Comment made from Zendesk by Zack on 2020-04-28 at 14:32:

Please report a bug in Github here https://github.com/netdata/netdata/issues/new/choose and provide as much detail as you can about your systems and setup.

Answer 4 · 2020-05-01T12:06:10.000Z

Comment made from Zendesk by Zack on 2020-04-28 at 14:32:

Christopher Akritidis is this related to this one bug about masters/slaves and not seeing slave nodes?

Answer 5 · 2020-05-01T12:06:11.000Z

Comment made from Zendesk by Christopher Akritidis on 2020-04-29 at 10:12:

Having it happen when TLS is turned on is key. I expect that the URL was http and now it should be https. So he could perhaps fix it by deleting some URLs from his visited nodes.

Answer 6 · 2020-05-01T12:06:11.000Z

Comment made from Zendesk by Christopher Akritidis on 2020-04-29 at 10:13:

I can show you exactly what I mean, if you want.

Answer 7 · 2020-05-01T12:06:12.000Z

Comment made from Zendesk by Costa Tsaousis on 2020-04-30 at 07:45:

Has anyone answered this?

On Tue, Apr 28, 2020 at 12:21 PM Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi,
I have trouble signing in to my netdata cloud.
I am using the master-slave functionality of netdata. But I cannot see the nodes properly and slave nodes are getting disconnected.

Would you please kindly help me how can I solve this issue?

Regards,
Pegah

--

Costa Tsaousis
Founder & CEO

Monitor everything ...in real-time!
https://www.netdata.cloud

GR Mobile: +306945494510 (also on @WhatsApp, @Viber, @telegram)
US Mobile: +1-(650)-885-8371
LinkedIn profile, ktsaou @github, ktsaou @skype

Answer 8 · 2020-05-01T12:06:13.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 13:17:

Yes, and I have not heard back yet.

Answer 9 · 2020-05-01T12:06:13.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 17:30:

Hi Pegah,

When you turn on TLS you may have to check your visited nodes and make sure you are connecting to them using https:// instead of http:// since the old URLs will no longer be reachable.

Please let us know if you are still having this issue!

Answer 10 · 2020-05-01T12:06:14.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:41:

Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:

and then, wrong machine guide for worker 1 and worker 3:

and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Answer 11 · 2020-05-01T12:06:14.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:42:

On Thu, 30 Apr 2020 at 19:41, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:

and then, wrong machine guide for worker 1 and worker 3:

and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Answer 12 · 2020-05-01T12:06:15.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:43:

There is no TLS. I just have this issue without TLS setting.

On Thu, 30 Apr 2020 at 19:42, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:

On Thu, 30 Apr 2020 at 19:41, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
Hi Zack,
Thanks, but I have deeper issue. I can see a lot of unreachable nodes which are replicating without any reason. As you see in the below screenshot:

and then, wrong machine guide for worker 1 and worker 3:

and unreachable for worker 2:
on the other hand, I only netdata-master-0 and worker 1, and worker 3 on the frontend.
I also receive this log from worker 2:
"netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again"

this behavior is not really normal. I also poster my issue, but they couldnt help me.
I would be so much appreciated if you help me or give me the hint for this problem.

Thanks a lot,
Pegah

Answer 13 · 2020-05-01T12:06:16.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 17:47:

netdata/netdata#8847

Answer 14 · 2020-05-01T12:06:16.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 17:53:

ignore, i think the prev linked issue is not the right one (and i can't edit that comment)

Answer 15 · 2020-05-01T12:06:17.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 17:55:

If you made a Github issue, can you link me that issue?

Also, what is your Netdata version? You can get that with "netdata -v"

Answer 16 · 2020-05-01T12:06:18.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 17:57:

netdata v1.20.0-278-nightly

and here is the issue:

netdata/netdata#8847 (comment)

Answer 17 · 2020-05-01T12:06:18.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 18:01:

probably you can access UI?

http://152.66.245.138:22647/host/master-from-snapshot/#menu_system_submenu_idlejitter;theme=slate;help=true

although we are reinstalling k8s in few hours to check if it will solve the issue

On Thu, 30 Apr 2020 at 19:57, Pegah Rahimian <rahimian.pegah@gmail.com> wrote:
netdata v1.20.0-278-nightly

and here is the issue:

netdata/netdata#8847 (comment)

Answer 18 · 2020-05-01T12:06:19.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 19:19:

I cannot access the UI.

My current version is netdata v1.21.1 - do you have an update available?

I went through the github issue. When you reinstall k8s are you going to redeploy/update your helm charts?

Answer 19 · 2020-05-01T12:06:20.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 20:03:

I also tried v21. But same issue.
for reinstalling, I will use "helm delete netdata", then: " helm install netdata ./path_to_hem_chart"

I assume it should be delete and purge as Im using helm3. But you as you see in the screenshot, I can see a lot of previous unreachable nodes. I dont understand how those nodes are replicated.

Thanks,
Pegah

Answer 20 · 2020-05-01T12:06:20.000Z

Comment made from Zendesk by Zack on 2020-04-30 at 20:17:

Ok. Let me know when you reinstall, then we can add some details to the github issue if that doesn't fix it.

Answer 21 · 2020-05-01T12:06:21.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-04-30 at 20:19:

Thanks a lot for your help. I will contact you tomorrow,

Pegah

Answer 22 · 2020-05-01T12:06:22.000Z

Comment made from Zendesk by Christopher Akritidis on 2020-05-01 at 11:19:

I know where the issue is coming from, will write on GitHub.

Answer 23 · 2020-05-01T12:06:22.000Z

Comment made from Zendesk by Christopher Akritidis on 2020-05-01 at 12:05:

Ok, I saw the discussion in 8847 and it has to do with streaming, which is unrelated to this report. I will unlink it, so it doesn't get confusing.

The issue here is caused by an confusing link between the "alarm persistence" and the netdata instance unique identifier (MACHINE_GUID), which is not apparent in the documentation for values.yaml. `/var/lib/netdata` is used to store both the alarms log and the `registry` information (specifically netdata.registry.unique.id). So we'll need to update our helm chart, to make it much clearer. We'll possibly need to force /var/lib/netdata to always use persistent storage, but that will be discussed in an issue.

What you can do from your side as a workaround, to prevent new netdata masters from appearing is to turn the master alarms persistence flag to true and provide a storageclass that will work in your environment.

Answer 24 · 2020-05-01T12:13:45.000Z

We basically have three options here:

Provide more options for persistence, that distinguish between the various files stored under /var/lib/netdata. A claim.d directory is getting added there as well. I'm not fond of this option, because it means we'll need to always add new options and require new volumes every time we add a new file or directory there.
Change the docs and the config files so that it's clear it's not just alarms that supposedly use the persistent storage for /var/lib/netdata but also key things like the machine guid, the claiming information and the health management API key. We strongly recommend that users persist this volume and only disable it if they don't want to use netdata cloud.
Just demand at least one persistent volume, to be used for /var/lib/netdata, removing completely the option to disable it.

Would like to hear your views.

Answer 25 · 2020-05-01T12:15:40.000Z

cc @Peggy4444

Answer 26 · 2020-05-03T19:26:33.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-05-03 at 19:26:

Hi,
Thanks for checking. We have re-installed k8s, and installed netdata with helm chart (v21) again. I have also set the persistent flags for alarm and database to true. But we still cannot access the slaves immediately. There are only 2 nodes: "netdata-master-0 & master" So we still have streaming problem for slaves. And this is the log I receive from all workers:

netdata ERROR : STREAM_SENDER[worker-2] : Cannot resolve host 'netdata', port '19999': Try again (errno 22, Invalid argument)

Would you please kindly help me how to solve the streaming issue? Does it have to anything with localStorages?
http://152.66.245.138:22647

Answer 27 · 2020-05-03T20:57:58.000Z

Comment made from Zendesk by Zack on 2020-05-03 at 20:57:

I cannot access the address you provided, is it internal?

The helm chart was updated recently (1.2.0 for 1.21.0).

For now I made a new helm chart issue here: #94 in case it's a problem with how the services are set up.

Answer 28 · 2020-05-04T15:34:54.000Z

Comment made from Zendesk by Zack on 2020-05-04 at 15:34:

Hi Pegah,

Just to keep you updated: We've decided to continue the discussion in issue 8847. There are a few things for us to test out - thank you for reporting this, you are helping make Netdata better!
Have a look at the latest comments and let me know if some of those suggestions work for you.

Answer 29 · 2020-05-04T21:05:36.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-05-04 at 21:05:

Hi,
Thanks, here is what we see:

1) This issue is not concerning us as we are not using netdata cloud. So thats not a big deal. I have just asked you as it was un normal to me.

2) we want to encrypt intra-cluster communication, but we could not set TLS master-slave successfully yet

That will be great if you let me know the result of your testing, as we are still in trouble of streaming configuration, and I don’t have any idea how to reach out all the slaves after enabling TLS. Are you able to do that while your testing phase?

Thanks,
Pegah

Answer 30 · 2020-05-04T22:21:26.000Z

Comment made from Zendesk by Zack on 2020-05-04 at 22:21:

I will let you know once our testing is complete. If we run into unexpected problems, the discussion will be in the issue I linked previously.

Answer 31 · 2020-05-04T22:28:20.000Z

Ok thanks, That will be great if you just describe the result of testing, regardless of being successful or not, and sharing configuration in the issue page. Thanks again and regards, Pegah

…

On Tue, May 5, 2020 at 00:21 Chris Akritidis ***@***.***> wrote: *Comment made from Zendesk by Zack on 2020-05-04 at 22:21:* I will let you know once our testing is complete. If we run into unexpected problems, the discussion will be in the issue I linked previously. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#93 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANGFKP4TNOX74WRNRVCNFFTRP45XHANCNFSM4MXBYWUQ> .

Answer 32 · 2020-05-19T22:54:36.000Z

Comment made from Zendesk by Pegah Rahimian on 2020-05-19 at 22:54:

Dear Zack,

Is there someone who I can ask how to enable weblog module in netdata helm chart?
I want to see upstream_response_time chart in netdata. (probably for nginx).
I have enabled go.d plugin in values.yaml, but still no upstream_response_time chart in UI.
Do I need any other configuration?

Thanks again,
Pegah

Answer 33 · 2020-05-20T07:29:52.000Z

Comment made from Zendesk by Zack on 2020-05-20 at 07:29:

You need to configure both nginx and Netdata and make sure to restart both. Also see https://learn.netdata.cloud/docs/agent/collectors/python.d.plugin/nginx/

I will look into the specific value and the helm chart and let you know.

Answer 34 · 2020-05-20T16:50:26.000Z

Comment made from Zendesk by Zack on 2020-05-20 at 16:50:

#100

Answer 35 · 2020-07-08T11:21:56.000Z

Closing in favor of #114