Infinite disk deadlock

Question

Infinite disk deadlock

sbrl opened this issue 4 years ago · 4 comments

Hello there again!

I've upgraded to a Raspberry Pi 4 (4GB RAM) after my last 3B+ died. Unfortunately, with the latest version of Laminar I'm encountering what I think is a deadlock. It happens after a number of hours, and laminard looks like this in htop:

The rest of the system is idle:

When it is in this state, any laminarc commands that talk via the UNIX socket hang infinitely, as do any http requests.

Edit: The only solution is to force-reboot the entire host it's running on.

Even a sudo kill -9 $(pgrep laminard) doesn't help.

Laminar version: laminard version 1.0-1~upstream-debian10
uname -a: Linux eldarion 5.10.11-v7l+ #1399 SMP Thu Jan 28 12:09:48 GMT 2021 armv7l GNU/Linux

Additional details:

$ cat /etc/systemd/system/laminar.service.d/override.conf
[Service]
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir /run/laminar
ExecStartPre=-/bin/chown laminar:laminar /run/laminar
ExecStartPost=-/bin/sh -c 'sleep 3; /bin/chmod 0775 /run/laminar/laminar.sock; echo "Permissions on socket set"; ls -l /run/laminar;'
#ExecStartPost=-/bin/sh -c 'while [ ! -S "/run/laminar/laminar.sock" ]; do sleep 1; done; /bin/chmod 0775 /run/laminar/laminar.sock'
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Raspbian
Description:	Raspbian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

I'm now using Consul to monitor Laminar. The service definition looks like this:

services {
	name = "laminar"
	tags = [
		"infrastructure"
	]
	address = "eldarion.node.mooncarrot.space"
	port = 3050
	
	checks {
		id = "laminar-http"
		name = "Laminar CI HTTP"
		http = "http://eldarion.node.mooncarrot.space:3050/"
		method = "GET"
		interval = "60s"
		timeout = "5s"	
	}
	checks {
		id = "laminar-unix-sock"
		name = "Laminar CI Unix Socket"
		args = [ "/usr/bin/ls", "/run/laminar/laminar.sock" ],
		interval = "30s"
		timeout = "2s"
	}
}

/etc/laminar.conf

### ### LAMINAR_HOME ### ### Root location containing laminar configuration, database, ### build workspaces and archive. ### ### Default: /var/lib/laminar ### LAMINAR_HOME=/srv/laminar LAMINAR_BIND_HTTP Interface on which laminard will bind to serve the Web UI. May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME Default: *:8080 We are reverse-proxied by magicbag This is ok, because ufw blocks access from anywhere that isn't over the wireguard VPN LAMINAR_BIND_HTTP=*:3050 LAMINAR_BIND_RPC Interface on which laminard will bind to accept RPC from laminarc. May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME Default: unix-abstract:laminar LAMINAR_BIND_RPC=unix:/run/laminar/laminar.sock LAMINAR_TITLE Page title to show in web frontend LAMINAR_TITLE=SBRL CI Service - Laminar LAMINAR_KEEP_RUNDIRS Setting this prevents the immediate deletion of job rundirs $LAMINAR_HOME/run/$JOB/$RUN. Value should be an integer represeting the number of rundirs to keep. Default: 0 #LAMINAR_KEEP_RUNDIRS=0 LAMINAR_BASE_URL Base url for the frontend. This affects the tag and needs to be set if Laminar runs behind a reverse-proxy that hosts Laminar within a subfolder (rather than at a subdomain root) #LAMINAR_BASE_URL=/ LAMINAR_ARCHIVE_URL Base url used to request artifacts. Laminar can serve build artifacts (and it will if you leave this unset), but it uses a very naive and inefficient method. Best to let a real webserver handle serving those requests.

LAMINAR_ARCHIVE_URL=https://ci.starbeamrainbowlabs.com/archive/

Relevant entries from /var/log/syslog


Feb 23 21:13:42 eldarion consul[948]:     2021-02-23T21:13:42.103Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:13:42 eldarion consul[948]:     2021-02-23T21:13:42.518Z [INFO]  agent: Synced check: check=laminar-http
Feb 23 21:14:47 eldarion consul[948]:     2021-02-23T21:14:47.104Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:15:52 eldarion consul[948]:     2021-02-23T21:15:52.107Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:15:56 eldarion kernel: [429713.731978] INFO: task laminard:787 blocked for more than 122 seconds.
Feb 23 21:15:56 eldarion kernel: [429713.732027] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:16:57 eldarion consul[948]:     2021-02-23T21:16:57.108Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:17:59 eldarion kernel: [429836.614718] INFO: task laminard:787 blocked for more than 245 seconds.
Feb 23 21:17:59 eldarion kernel: [429836.614766] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:18:02 eldarion consul[948]:     2021-02-23T21:18:02.109Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:19:07 eldarion consul[948]:     2021-02-23T21:19:07.111Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:20:01 eldarion kernel: [429959.495982] INFO: task laminard:787 blocked for more than 368 seconds.
Feb 23 21:20:01 eldarion kernel: [429959.496030] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:20:12 eldarion consul[948]:     2021-02-23T21:20:12.112Z [WARN]  agent: Check is now critical: check=laminar-http

Edit 2: lsof of laminard while deadlocked:

Answer 1 · 2021-02-26T02:33:38.000Z

Hmm. Laminar is single threaded so cannot deadlock in the usual sense. Most likely it is stuck in a system call. It would be interesting if you could launch laminar under strace, but note that that requires patching run.cpp to call the real laminard (usually /usr/sbin/laminard) instead of /proc/self/exe.

Are you using the SD card on your Pi? They are notoriously unreliable, and if laminar's workdir is mounted there this might be the cause of the problem - block layer operations are typical for causing the kind of freeze that kill -9 won't get you out of.

As well as the strace test, I would suggest trying to mount laminar's home (including sqlite database) on a trustworthy external storage and see if the problem is reproducible.

Answer 2 · 2021-03-08T21:02:13.000Z

Oops, didn't see that you'd replied here! Sorry about the wait.

I'd be happy to switch to a new binary if it were precompiled.

Nope, I'm actually using an external HDD. It passed it's latest SMART test too. I store laminar data in /srv/laminar, which is a symlink to a location on the external 1TB WD PiDrive.

Answer 3 · 2021-03-11T08:17:01.000Z

laminar_1.0-10-g9b8c3762-dirty-1~upstream-debian10_armhf.zip

Here's a precompiled copy for you that can be run under strace. strace will produce lots of output, but it would be useful to see the last few lines before it locks up.

Answer 4 · 2021-03-13T21:56:35.000Z

Thanks! I'll keep my laminar instance under observation. After your last comment, I discovered an ext4 filesystem corruption issue on the drive I stored my laminar data on. It was very strange though, because:

fsck.ext4 came back clean
A read-only badblocks test was clean
Taking a copy of the disk and uploading it to another machine where I mounted it via loopback worked fine

...so I've copied the laminar directory off and mounted it onto the CI server via NFS, and I'm keeping a close eye on it. If it locks up again, I'll try straceing it with the binary you've provided.

I'll close this issue for now, but if it happens again and I strace it, I'll reopen with more info.

Thanks for the help!