BiznetGIO/RESTKnot

Knot DNS socket issue

Closed this issue ยท 2 comments

We always experience this socket issue.
There will always a time when the libknot refuses to works.
The occurrence is once/twice every month.
We can not reproduce the issue, but it always happens.
Usually when there are many requests to the libknot. Sometimes it also happens even the request is low.

When it happened. Most of the time knotd refuses to start.
We need to export the zones, remove some of the data in var/lib/knot, import, then start the knotd again.
It is our "hack" for a couple of 16 months.
It happens so often, that we write the "known problem" for this case.

We have tried to upgrade to v3.0.4 but the problem persists.

KnotCtlError: connection reset (data: None)
ValueError: Can't connect to knot socket: operation not permitted (data: None)
ValueError: Can't connect to knot socket: OS lacked necessary resources (data: None)

The solution for the 3 error message above is:

  1. restart the python app. if it's failed ->
  2. then the knotd must be stopped. restart knotd ->
  3. if it's refused. then export all zones -> remove all data in /var/lib/knot/confdb, timers and journal. Then start knotd.

Fortunately, the third step always works.
But I hope there will be a better solution (stopping it from happening).

Thanks.

Yes, Indeed the issue has been our top priority to solve for the past 16 months.
It is our long-standing issue.

Our previous backlog was empty now. So this week we will revisit this issue deeper.


This week we have been discussed the mentioned issue with Daniel Salzman (Current Knot DNS maintainer) in Knot DNS gitter channel. Which was replied to by Salzman in just a second, as we expect.
We always asked him, how he replied so fast. He always responds "Because knot is fast :)"
Fortunately, He is very helpful and kind.

After some discussion, he points us to the similar issue. He suggests increasing the socket timeout and enabling blocking flag (-b).

We never use both. So we hope either one will be the solution.

Since the first time. We never use flags and filter in libknot's send_block. Because of the lack of documentation. Which the Knot DNS maintainer admits.

Daniel told us the type and available options directly from Knot DNS source code

With enough information. We streak to the experiment immediately.

A Journey to solve our Knot DNS long-standing issue

These issues happen only once or twice a month. So It hard to reproduce.

KnotCtlError: connection reset (data: None)
ValueError: Can't connect to knot socket: operation not permitted (data: None)
ValueError: Can't connect to knot socket: OS lacked necessary resources (data: None)

Fortunately, we added knot_expoter to each machines running knot. And every time we add a zone. Either the second or the third error above comes up.

We told Salzman that currently, we can reproduce the issue above. At any given moment. So we are ready to give any necessary log he needs to investigate it further.

Meanwhile, we also try to play with -b (block flag) and -t socket timeout on our side.

--- ๐ŸŒต ๐ŸŒต ๐ŸŒต ---

Each time we add a new zone. This is happens

img

First experiment

First, we try to add B to the flags parameters. Now the JSON data becomes:

{
    "cmd":"zone-set"
    "zone":NULL
    "owner":"@"
    "rtype":"NS"
    "ttl":"3600"
    "data":"dua.dns.id."
+    "flags": ["B"]
}

But the problem persists.

ValueError: Can't connect to knot socket: operation not permitted (data: None)

Second experiment

The first experiment failed. Our second experiment is to add B in the knot_exporter itself.
Suprisingly ghedo/knot_exporter/ use F as flag by default.
We removed the flag. But the problem persists.
We also changed the F to B. That doesn't work too.

The error is always the same:

ValueError: Can't connect to knot socket: operation not permitted (data: None)

Third experiment

The flags didn't help. So we try out the -t timeout.
We increased the default timeout in knot_exporter from 2000ms to 5000ms.
One improvement we had. The knot_exporter still running. But the Python libknot app exited.

Now with a different error message:

ValueError: Can't connect to knot socket: OS lacked necessary resources (data: None)

๐Ÿ”ฅ Now our Python libknot app refuses to start. It is always excited after processing a couple of zones. With the same error as above.
We realize that our Python libknot app (as for now let's call it a restknot-agent or agent for short). Doesn't set a timeout.
So we added it:

knot_ctl.connect(knot_socket_path)
+knot_ctl.set_timeout(knot_socket_timeout)

The default is 60 seconds, so try to increase it to 1000. Doesn't work. We also tried to set it to infinity 0. But that doesn't work too.

Fourth experiment

We thought it may be because we have 2 GB of ram only?. Okay. We resize the machine to 4 GB.
It didn't help.

Fifth experiment

Because there is no way for the agent to start. We decided to remove all the zones in our Kafka broker.
Since the agent loads the previous zone when it starts.
We suspect it because of the B flags.

With the fresh environment. agent starts smoothly. But indeed, after adding a zone. It exited with the same error message.

ValueError: Can't connect to knot socket: operation not permitted (data: None)

We stopped for a while. Take a deep breath. And notice that operation not permitted might be related to the OS permission.
We know that running the agent that using libknot that talks to its socket. We can't run it as our current user.
That's why we have to run it as sudo agent.py since the first time we develop this.
When it comes to docker, we had the same problem. So we add :Z in the volume. It works right away.
But this only happens in CentOS. In Debian, the problem doesn't occur even though we don't suffix it with :Z

    volumes:
-      - /var/run/knot/:/var/run/knot/
+      - /var/run/knot/:/var/run/knot/:Z

This is what we use for 16 months in our machines. It keeps getting errors.
Now we think it's time to revisit this OS permission.

๐ŸŸข First, we try to use user: knot in our docker-compose configurations.
It doesn't work

Cannot start service agent: linux spec user: unable to find user knot: no matching entries in passwd file

๐ŸŸข Then we try to use UID user: "996", and both UID:GID user:"996:993"
Both didn't work.

A dirty solution is to change every file owned by the Knot DNS to root or our current user.
But it's not recommended, as Salzman condemn this hack several months ago.
So I avoid going this dirty way.

Setting it to root (user: root) also didn't work. It always gives the same
error message:

๐ŸŸข So, we revisit the holy :Z.

Turns out the capitalized :Z is for private unshared label.
and the small :z is the volume content will be shared between containers.

We rush to change every /var/run/knot/:/var/run/knot/:Z to /var/run/knot/:/var/run/knot/:z.
Hit the Send button to add a new zone. Turns out every container (both agent and knot_exporter) still standing strong, even after several hits.

Now everything is solved ๐ŸŽ‰.

img

The mystery

Why it took such a long time to notice?

The strange thing is: Even if we run the exact same container with the exact same configs.
Only one among our 12 nodes happens to have the issue. So we never thought that the root cause is the config.

The same thing with our experiment above. Only 3 of 4 nodes die. Even the fourth also has the exact same config as the rest.

Closing

It just an hour after the experiment. Let's see what happens after days with so many hits.
If the same problem appears, I might try to play with -b and -t.

Our previous assumption is right: The root cause of this problem is the socket. But we blame the Knot. Turns out it's our config that needs to be blamed. We ignore the config because it only happens to one of the containers, as we explained in the The mystery section.

In the second experiment, we also replaced ghedo/knot_exporter with salzmdan/knot_exporter (a more maintained and feature fork from Knot DNS maintainer). Thinking that the problems lie in ghedo/knot_exporter. But the problem persists. So it doesn't have anything to do with the knot_exporter. After realizing the holy :Z is the problem. We tried to deploy a mix of ghedo/knot_exporter and salzmdan/knot_exporter to the nodes along with the agent. Turns out all of them are still standing strong. This is to make sure that the root cause is exactly the holy :Z.

Bonus

Out of curiosity, after it settled and running smoothly with the lower :z fix.
We want to re-apply the B flags and see how it goes.

๐Ÿ”ฅ three from our four nodes die. And the bad news they refuse to start again.
So our previous assumption that the cause is the B flag is basically right.
Now, we need to remove all the messages with the B flag in our broker and start the agent again.

Because Prometheus targets dashboard only checks whether the /metrics is reachable.
It will always say up even if our agent dies.

So we open multiple tmux panes, watch the docker ps. then hit them by creating several new zones.

After 10 minutes, they still sanding strong ๐ŸŽ‰.
So I hope this fix is the end game.

img