ohwgiles/laminar

Infinite disk deadlock

sbrl opened this issue · 4 comments

sbrl commented

Hello there again!

I've upgraded to a Raspberry Pi 4 (4GB RAM) after my last 3B+ died. Unfortunately, with the latest version of Laminar I'm encountering what I think is a deadlock. It happens after a number of hours, and laminard looks like this in htop:

Selection_2021-02-23_21:22:02_001_c5bb238

The rest of the system is idle:

Selection_2021-02-23_21:22:31_001_753ca5b

When it is in this state, any laminarc commands that talk via the UNIX socket hang infinitely, as do any http requests.

Edit: The only solution is to force-reboot the entire host it's running on.

Even a sudo kill -9 $(pgrep laminard) doesn't help.

  • Laminar version: laminard version 1.0-1~upstream-debian10
  • uname -a: Linux eldarion 5.10.11-v7l+ #1399 SMP Thu Jan 28 12:09:48 GMT 2021 armv7l GNU/Linux

Additional details:

$ cat /etc/systemd/system/laminar.service.d/override.conf
[Service]
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir /run/laminar
ExecStartPre=-/bin/chown laminar:laminar /run/laminar
ExecStartPost=-/bin/sh -c 'sleep 3; /bin/chmod 0775 /run/laminar/laminar.sock; echo "Permissions on socket set"; ls -l /run/laminar;'
#ExecStartPost=-/bin/sh -c 'while [ ! -S "/run/laminar/laminar.sock" ]; do sleep 1; done; /bin/chmod 0775 /run/laminar/laminar.sock'
$ lsb_release -a
No LSB modules are available.
Distributor ID:	Raspbian
Description:	Raspbian GNU/Linux 10 (buster)
Release:	10
Codename:	buster

I'm now using Consul to monitor Laminar. The service definition looks like this:

services {
	name = "laminar"
	tags = [
		"infrastructure"
	]
	address = "eldarion.node.mooncarrot.space"
	port = 3050
	
	checks {
		id = "laminar-http"
		name = "Laminar CI HTTP"
		http = "http://eldarion.node.mooncarrot.space:3050/"
		method = "GET"
		interval = "60s"
		timeout = "5s"	
	}
	checks {
		id = "laminar-unix-sock"
		name = "Laminar CI Unix Socket"
		args = [ "/usr/bin/ls", "/run/laminar/laminar.sock" ],
		interval = "30s"
		timeout = "2s"
	}
}
/etc/laminar.conf

###
### LAMINAR_HOME
###
### Root location containing laminar configuration, database,
### build workspaces and archive.
###
### Default: /var/lib/laminar
###
LAMINAR_HOME=/srv/laminar

LAMINAR_BIND_HTTP

Interface on which laminard will bind to serve the Web UI.

May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME

Default: *:8080

We are reverse-proxied by magicbag

This is ok, because ufw blocks access from anywhere that isn't over the wireguard VPN

LAMINAR_BIND_HTTP=*:3050

LAMINAR_BIND_RPC

Interface on which laminard will bind to accept RPC from laminarc.

May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME

Default: unix-abstract:laminar

LAMINAR_BIND_RPC=unix:/run/laminar/laminar.sock

LAMINAR_TITLE

Page title to show in web frontend

LAMINAR_TITLE=SBRL CI Service - Laminar

LAMINAR_KEEP_RUNDIRS

Setting this prevents the immediate deletion of job rundirs

$LAMINAR_HOME/run/$JOB/$RUN. Value should be an integer represeting

the number of rundirs to keep.

Default: 0

#LAMINAR_KEEP_RUNDIRS=0

LAMINAR_BASE_URL

Base url for the frontend. This affects the tag and needs

to be set if Laminar runs behind a reverse-proxy that hosts Laminar

within a subfolder (rather than at a subdomain root)

#LAMINAR_BASE_URL=/

LAMINAR_ARCHIVE_URL

Base url used to request artifacts. Laminar can serve build

artifacts (and it will if you leave this unset), but it

uses a very naive and inefficient method. Best to let a real

webserver handle serving those requests.

LAMINAR_ARCHIVE_URL=https://ci.starbeamrainbowlabs.com/archive/

Relevant entries from /var/log/syslog

Feb 23 21:13:42 eldarion consul[948]:     2021-02-23T21:13:42.103Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:13:42 eldarion consul[948]:     2021-02-23T21:13:42.518Z [INFO]  agent: Synced check: check=laminar-http
Feb 23 21:14:47 eldarion consul[948]:     2021-02-23T21:14:47.104Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:15:52 eldarion consul[948]:     2021-02-23T21:15:52.107Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:15:56 eldarion kernel: [429713.731978] INFO: task laminard:787 blocked for more than 122 seconds.
Feb 23 21:15:56 eldarion kernel: [429713.732027] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:16:57 eldarion consul[948]:     2021-02-23T21:16:57.108Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:17:59 eldarion kernel: [429836.614718] INFO: task laminard:787 blocked for more than 245 seconds.
Feb 23 21:17:59 eldarion kernel: [429836.614766] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:18:02 eldarion consul[948]:     2021-02-23T21:18:02.109Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:19:07 eldarion consul[948]:     2021-02-23T21:19:07.111Z [WARN]  agent: Check is now critical: check=laminar-http
Feb 23 21:20:01 eldarion kernel: [429959.495982] INFO: task laminard:787 blocked for more than 368 seconds.
Feb 23 21:20:01 eldarion kernel: [429959.496030] task:laminard        state:D stack:    0 pid:  787 ppid:     1 flags:0x00000001
Feb 23 21:20:12 eldarion consul[948]:     2021-02-23T21:20:12.112Z [WARN]  agent: Check is now critical: check=laminar-http

Edit 2: lsof of laminard while deadlocked:

eldarion_2021-02-23_21:29:04_001_4312e12

Hmm. Laminar is single threaded so cannot deadlock in the usual sense. Most likely it is stuck in a system call. It would be interesting if you could launch laminar under strace, but note that that requires patching run.cpp to call the real laminard (usually /usr/sbin/laminard) instead of /proc/self/exe.

Are you using the SD card on your Pi? They are notoriously unreliable, and if laminar's workdir is mounted there this might be the cause of the problem - block layer operations are typical for causing the kind of freeze that kill -9 won't get you out of.

As well as the strace test, I would suggest trying to mount laminar's home (including sqlite database) on a trustworthy external storage and see if the problem is reproducible.

sbrl commented

Oops, didn't see that you'd replied here! Sorry about the wait.

I'd be happy to switch to a new binary if it were precompiled.

Nope, I'm actually using an external HDD. It passed it's latest SMART test too. I store laminar data in /srv/laminar, which is a symlink to a location on the external 1TB WD PiDrive.

laminar_1.0-10-g9b8c3762-dirty-1~upstream-debian10_armhf.zip

Here's a precompiled copy for you that can be run under strace. strace will produce lots of output, but it would be useful to see the last few lines before it locks up.

sbrl commented

Thanks! I'll keep my laminar instance under observation. After your last comment, I discovered an ext4 filesystem corruption issue on the drive I stored my laminar data on. It was very strange though, because:

  • fsck.ext4 came back clean
  • A read-only badblocks test was clean
  • Taking a copy of the disk and uploading it to another machine where I mounted it via loopback worked fine

...so I've copied the laminar directory off and mounted it onto the CI server via NFS, and I'm keeping a close eye on it. If it locks up again, I'll try straceing it with the binary you've provided.

I'll close this issue for now, but if it happens again and I strace it, I'll reopen with more info.

Thanks for the help!