oakestra/oakestra-net

Strange netcat flakiness (maybe ?)

Closed this issue · 1 comments

Short

When using a netcat client and server setup in Oakestra flakiness occurs, i.e. if you send 20 messages several of them get dropped with no clear pattern or explanation. - This was my assumption a week ago, when I now replicated this it no longer occurred ...

Deeper description of the bug

Concrete Example

Let's take service C (client) and S (server).
(Note: There are multiple different netcat implementations. I used "netcat-traditional".)
We run this script on S

#!/bin/bash
while true
  do
     # Make sure to use -p otherwise no message can get propagated.
     # -w teminates the server session after 5 s. This is used to avoid getting stuck due to a broken/stale connection.
      nc -l -p 99 -w 5
  done

And this cmd on C:

# -q is used to terminate the client after sending the message to avoid getting the client stuck.
for i in {1..20}; do echo $i | nc 10.30.27.3 99 -q 1; done

This screenshot shows outputs of flaky behavior. On my local setup (non Oakestra) there is a smooth flow from 1-20.
image

Update

I have pushed a custom image that has that exact netcat version installed as well as provided both scripts there for easier testing.

Here is the SLA:

{
  "sla_version": "v2.0",
  "customerID": "Admin",
  "applications": [
      {
          "applicationID": "",
          "application_name": "app",
          "application_namespace": "test",
          "application_desc": "",
          "microservices": [
            {
              "microserviceID": "",
              "microservice_name": "server",
              "microservice_namespace": "test",
              "virtualization": "container",
              "cmd": ["bash","server.sh"],
              "memory": 100,
              "vcpus": 1,
              "storage": 0,
              "code": "ghcr.io/malyuk-a/netcat:testing",
              "addresses": {
                "rr_ip": "10.30.27.3"
              }
            },
            {
                "microserviceID": "",
                "microservice_name": "client",
                "microservice_namespace": "test",
                "virtualization": "container",
                "cmd": ["bash","client.sh"],
                "memory": 100,
                "vcpus": 1,
                "storage": 0,
                "code": "ghcr.io/malyuk-a/netcat:testing",
                "one_shot": true
              }
        ]
      }
  ]
}

Interestingly enough, now I no longer can replicate that flakiness ...
image
Example output:

...
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
...

I have no idea why or what is going on differently now all of a sudden. When experiencing the bug mentioned above I was able to replicate that behavior multiple times in the span of ~2-3 days.

I am able to see rather strange behavior when it comes to redeploying these services, I am not sure if this is related or due to the CLI tool I use that handles these deployments/creations in a very quick way.

Solution

We (@giobart, @smnzlnsk, @Malyuk-A) had a look and we couldn't spot errors in the NetManager logs, thus this needs deep analysis.

Update

Right now I simply want to know what others can observe. When you run the same SLA do you see a nice flow from 1-20 or do you see gaps? If multiple people do not see any gaps we can close this Issue I guess.

Status

Replicated and discussed. Needs deep analysis to figure out what is going wrong.
Why might this be critical: If a classic tool like netcat is not properly working who knows how other tools behave? This can very much interfere with proper practical/scientific work.

Update

Let's see, maybe this was a very strange anomaly on my local system's side.

Checklist

  • Discussed
  • Replicated
  • Solved
  • Tested

Update:

I have tested these things one more time and at least with netcat version [v1.10-41.1] the strange errors did not re-appear.

I will now close this issue. It will be kept in the repository as an archive.
In case a similar issue should come up in the future this one can be re-evaluated, or at least its SLA can be reused.