Ylianst/MeshAgent

MeshAgent on Linux crashing when in use

EDIflyer opened this issue ยท 7 comments

I'm finding MeshAgent in general runs fine on my Pi 400 however if I'm connected I find it keeps disconnecting every so often (CPU/memory usage looks OK).

The journalctl -u meshagent command shows:

May 06 19:59:26 pi400 systemd[1]: Started meshagent background service.
May 06 20:01:40 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:01:40 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:01:43 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:01:43 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 10.
May 06 20:01:43 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:01:43 pi400 systemd[1]: Started meshagent background service.
May 06 20:02:48 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:02:48 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:02:51 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:02:51 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 11.
May 06 20:02:51 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:02:51 pi400 systemd[1]: Started meshagent background service.
May 06 20:04:15 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:04:15 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:04:18 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:04:18 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 12.
May 06 20:04:18 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:04:18 pi400 systemd[1]: Started meshagent background service.
May 06 20:05:40 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:05:41 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:05:44 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:05:44 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 13.
May 06 20:05:44 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:05:44 pi400 systemd[1]: Started meshagent background service.
May 06 20:09:25 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:09:25 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:09:28 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:09:28 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 14.
May 06 20:09:28 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:09:28 pi400 systemd[1]: Started meshagent background service.
May 06 20:11:25 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:11:25 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:11:28 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:11:28 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 15.
May 06 20:11:28 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:11:28 pi400 systemd[1]: Started meshagent background service.
May 06 20:32:08 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:32:08 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:32:11 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:32:11 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 16.
May 06 20:32:11 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:32:11 pi400 systemd[1]: Started meshagent background service.
May 06 20:33:25 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:33:25 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:33:28 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:33:28 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 17.
May 06 20:33:28 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:33:28 pi400 systemd[1]: Started meshagent background service.
May 06 20:36:56 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:36:56 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:36:59 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:36:59 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 18.
May 06 20:36:59 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:36:59 pi400 systemd[1]: Started meshagent background service.
May 06 20:38:36 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 20:38:36 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 20:38:39 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 20:38:39 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 19.
May 06 20:38:39 pi400 systemd[1]: Stopped meshagent background service.
May 06 20:38:39 pi400 systemd[1]: Started meshagent background service.
May 06 22:56:47 pi400 systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 06 22:56:47 pi400 systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 06 22:56:50 pi400 systemd[1]: meshagent.service: Service RestartSec=3s expired, scheduling restart.
May 06 22:56:50 pi400 systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 20.
May 06 22:56:50 pi400 systemd[1]: Stopped meshagent background service.
May 06 22:56:50 pi400 systemd[1]: Started meshagent background service.

Have just checked another machine (Pi 4) and same issue, again only seems to be behaviour seen in recent months and only seems to be when sessions are active (I go on this unit a few times a month, only started noticing the problem recently)...

Dec 17 23:19:09 raspberrypi systemd[1]: Started meshagent background service.
Dec 17 23:29:59 raspberrypi systemd[1]: Stopping meshagent background service...
Dec 17 23:29:59 raspberrypi systemd[1]: meshagent.service: Succeeded.
Dec 17 23:29:59 raspberrypi systemd[1]: Stopped meshagent background service.
Dec 17 23:29:59 raspberrypi systemd[1]: meshagent.service: Consumed 3.144s CPU time.
-- Boot c9d9a1f689fd4a1ca2d1ba5060d7e1c2 --
Dec 17 23:30:26 raspberrypi systemd[1]: Started meshagent background service.
Dec 17 23:49:56 raspberrypi systemd[1]: Stopping meshagent background service...
Dec 17 23:49:57 raspberrypi systemd[1]: meshagent.service: Succeeded.
Dec 17 23:49:57 raspberrypi systemd[1]: Stopped meshagent background service.
Dec 17 23:49:57 raspberrypi systemd[1]: meshagent.service: Consumed 3.292s CPU time.
-- Boot 8a24dd12fa604f3aaafa026d6bddeef6 --
Dec 17 23:50:23 raspberrypi systemd[1]: Started meshagent background service.
Dec 18 00:01:27 raspberrypi systemd[1]: Stopping meshagent background service...
Dec 18 00:01:28 raspberrypi systemd[1]: meshagent.service: Succeeded.
Dec 18 00:01:28 raspberrypi systemd[1]: Stopped meshagent background service.
Dec 18 00:01:28 raspberrypi systemd[1]: meshagent.service: Consumed 3.075s CPU time.
-- Boot 19963496513244faae7f452af47b2b9b --
Dec 18 00:01:54 raspberrypi systemd[1]: Started meshagent background service.
Feb 25 15:15:12 raspberrypi systemd[1]: Stopping meshagent background service...
Feb 25 15:15:12 raspberrypi systemd[1]: meshagent.service: Succeeded.
Feb 25 15:15:12 raspberrypi systemd[1]: Stopped meshagent background service.
Feb 25 15:15:12 raspberrypi systemd[1]: meshagent.service: Consumed 2min 55.098s CPU time.
Feb 25 15:15:12 raspberrypi systemd[1]: Started meshagent background service.
May 07 01:15:18 raspberrypi systemd[1]: meshagent.service: Main process exited, code=exited, status=254/n/a
May 07 01:15:18 raspberrypi systemd[1]: meshagent.service: Failed with result 'exit-code'.
May 07 01:15:19 raspberrypi systemd[1]: meshagent.service: Consumed 3min 4.918s CPU time.
May 07 01:15:22 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 1.
May 07 01:15:22 raspberrypi systemd[1]: Stopped meshagent background service.
May 07 01:15:22 raspberrypi systemd[1]: meshagent.service: Consumed 3min 4.918s CPU time.
May 07 01:15:22 raspberrypi systemd[1]: Started meshagent background service.

The agent console logs show:

 {
  "t": 1683403540,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683403765,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683403885,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683405128,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683405205,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683405416,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683405516,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683413807,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 },
 {
  "t": 1683415096,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 }
]

I'm still getting this - noted it more frequently now to the extent that I can't open a Desktop session (although Files/Terminal/Console all work absolutely fine). @Ylianst any help greatly appreciated!

> errorlog
[
 {
  "t": 1683422118,
  "m": "** CRASH **",
  "h": "65F6FE3B530FDBC3",
  "f": "microstack/ILibParsers.c",
  "l": "2883"
 }
]
May 21 21:17:23 raspberrypi systemd[1]: Started meshagent background service.
May 22 23:02:57 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 22 23:02:57 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 22 23:02:57 raspberrypi systemd[1]: meshagent.service: Consumed 4.992s CPU time.
May 22 23:03:00 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 1.
May 22 23:03:00 raspberrypi systemd[1]: Stopped meshagent background service.
May 22 23:03:00 raspberrypi systemd[1]: meshagent.service: Consumed 4.992s CPU time.
May 22 23:03:00 raspberrypi systemd[1]: Started meshagent background service.
May 22 23:03:03 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 22 23:03:03 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 22 23:03:03 raspberrypi systemd[1]: meshagent.service: Consumed 1.989s CPU time.
May 22 23:03:06 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 2.
May 22 23:03:06 raspberrypi systemd[1]: Stopped meshagent background service.
May 22 23:03:06 raspberrypi systemd[1]: meshagent.service: Consumed 1.989s CPU time.
May 22 23:03:06 raspberrypi systemd[1]: Started meshagent background service.
May 22 23:03:16 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 22 23:03:16 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 22 23:03:16 raspberrypi systemd[1]: meshagent.service: Consumed 2.025s CPU time.
May 22 23:03:19 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 3.
May 22 23:03:19 raspberrypi systemd[1]: Stopped meshagent background service.
May 22 23:03:19 raspberrypi systemd[1]: meshagent.service: Consumed 2.025s CPU time.
May 22 23:03:20 raspberrypi systemd[1]: Started meshagent background service.
May 22 23:03:22 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 22 23:03:22 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 22 23:03:22 raspberrypi systemd[1]: meshagent.service: Consumed 1.991s CPU time.
May 22 23:03:25 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 4.
May 22 23:03:25 raspberrypi systemd[1]: Stopped meshagent background service.
May 22 23:03:25 raspberrypi systemd[1]: meshagent.service: Consumed 1.991s CPU time.
May 22 23:03:25 raspberrypi systemd[1]: Started meshagent background service.
May 23 00:42:44 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 23 00:42:44 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 23 00:42:44 raspberrypi systemd[1]: meshagent.service: Consumed 3.254s CPU time.
May 23 00:42:47 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 5.
May 23 00:42:47 raspberrypi systemd[1]: Stopped meshagent background service.
May 23 00:42:47 raspberrypi systemd[1]: meshagent.service: Consumed 3.254s CPU time.
May 23 00:42:47 raspberrypi systemd[1]: Started meshagent background service.
May 23 00:43:45 raspberrypi systemd[1]: meshagent.service: Main process exited, code=killed, status=7/BUS
May 23 00:43:45 raspberrypi systemd[1]: meshagent.service: Failed with result 'signal'.
May 23 00:43:45 raspberrypi systemd[1]: meshagent.service: Consumed 1.908s CPU time.
May 23 00:43:48 raspberrypi systemd[1]: meshagent.service: Scheduled restart job, restart counter is at 6.
May 23 00:43:48 raspberrypi systemd[1]: Stopped meshagent background service.
May 23 00:43:48 raspberrypi systemd[1]: meshagent.service: Consumed 1.908s CPU time.

I've tried switching coredump on but it just seems to switch itself off again...

> coredump on
coredump is now on
> coredump status
coredump is: on
-- TRY TO CONNECT TO DESKTOP HERE --
> coredump status
coredump is: off
> coreinfo
{
  "action": "coreinfo",
  "caps": 15,
  "osdesc": "Raspbian GNU/Linux 11 (bullseye)",
  "root": true,
  "users": [
    "pi"
  ],
  "value": "Nov 21 2022, 1184527535"
}

Tried an uninstall/reinstall of the agent and it made things worse - now it just keeps restarting and I can't even access the terminal!

image

Confirm workaround with thanks to @diehummel - add arm_64bit=0 flag to /boot/config.txt - will close this issue for now as covered better in his bug above.

I've traced the problem to a Bus Error caused by an unaligned access. The bad access is in ILibWebClient_ProcessWebSocketData in microstack/ILibWebClient.c, and it happens on line 1363:

			unsigned long long v = ILibNTOHLL(((unsigned long long*)(buffer + offset + 2))[0]);

buffer is aligned to (at least) a 16-byte boundary, and offset is 0, which leaves the CPU trying to read an 8-byte value from an address which is only 2-byte aligned.

The 32-bit kernel includes an exception handler that fixes up misaligned accesses. It is controlled by the config option ALIGNMENT_TRAP, which is enabled by default. The 64-bit kernel has a similar facility controlled by COMPAT_ALIGNMENT_FIXUPS, but it is disabled by default and none of the defconfigs enable it.

In other words you've been getting away with it so far, but the code is not very portable and arm64 kernels are less forgiving.

Here's a minimal but dumb patch to fix it:

diff --git a/microstack/ILibWebClient.c b/microstack/ILibWebClient.c
index f0f9b6d..3e21e10 100644
--- a/microstack/ILibWebClient.c
+++ b/microstack/ILibWebClient.c
@@ -1360,7 +1360,9 @@ int ILibWebClient_ProcessWebSocketData(char* buffer, int offset, int length, ILi
		} 
		else
		{
-			unsigned long long v = ILibNTOHLL(((unsigned long long*)(buffer + offset + 2))[0]);
+			unsigned long long v;
+			memcpy(&v, buffer + offset + 2, sizeof(v));
+			v = ILibNTOHLL(v);
			if(v > 0x7FFFFFFFUL)
			{
				// this value is too big to store in a 32 bit signed variable, so disconnect the websocket.

Alternatively you could replace the ILibNTOHLL by four calls to ntohs() which then have their results glued back together in the appropriate order.

Amazing detective work, @pelwell - thanks! @Ylianst not sure if you're able to look at including this fix?

@diehummel for interest!