Infinite disk deadlock
sbrl opened this issue · 4 comments
Hello there again!
I've upgraded to a Raspberry Pi 4 (4GB RAM) after my last 3B+ died. Unfortunately, with the latest version of Laminar I'm encountering what I think is a deadlock. It happens after a number of hours, and laminard
looks like this in htop
:
The rest of the system is idle:
When it is in this state, any laminarc
commands that talk via the UNIX socket hang infinitely, as do any http requests.
Edit: The only solution is to force-reboot the entire host it's running on.
Even a sudo kill -9 $(pgrep laminard)
doesn't help.
- Laminar version:
laminard version 1.0-1~upstream-debian10
uname -a
:Linux eldarion 5.10.11-v7l+ #1399 SMP Thu Jan 28 12:09:48 GMT 2021 armv7l GNU/Linux
Additional details:
$ cat /etc/systemd/system/laminar.service.d/override.conf
[Service]
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir /run/laminar
ExecStartPre=-/bin/chown laminar:laminar /run/laminar
ExecStartPost=-/bin/sh -c 'sleep 3; /bin/chmod 0775 /run/laminar/laminar.sock; echo "Permissions on socket set"; ls -l /run/laminar;'
#ExecStartPost=-/bin/sh -c 'while [ ! -S "/run/laminar/laminar.sock" ]; do sleep 1; done; /bin/chmod 0775 /run/laminar/laminar.sock'
$ lsb_release -a
No LSB modules are available.
Distributor ID: Raspbian
Description: Raspbian GNU/Linux 10 (buster)
Release: 10
Codename: buster
I'm now using Consul to monitor Laminar. The service definition looks like this:
services {
name = "laminar"
tags = [
"infrastructure"
]
address = "eldarion.node.mooncarrot.space"
port = 3050
checks {
id = "laminar-http"
name = "Laminar CI HTTP"
http = "http://eldarion.node.mooncarrot.space:3050/"
method = "GET"
interval = "60s"
timeout = "5s"
}
checks {
id = "laminar-unix-sock"
name = "Laminar CI Unix Socket"
args = [ "/usr/bin/ls", "/run/laminar/laminar.sock" ],
interval = "30s"
timeout = "2s"
}
}
/etc/laminar.conf
### ### LAMINAR_HOME ### ### Root location containing laminar configuration, database, ### build workspaces and archive. ### ### Default: /var/lib/laminar ### LAMINAR_HOME=/srv/laminar
LAMINAR_BIND_HTTP
Interface on which laminard will bind to serve the Web UI.
May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME
Default: *:8080
We are reverse-proxied by magicbag
This is ok, because ufw blocks access from anywhere that isn't over the wireguard VPN
LAMINAR_BIND_HTTP=*:3050
LAMINAR_BIND_RPC
Interface on which laminard will bind to accept RPC from laminarc.
May be of the form IP:PORT, unix:PATH/TO/SOCKET or unix-abstract:NAME
Default: unix-abstract:laminar
LAMINAR_BIND_RPC=unix:/run/laminar/laminar.sock
LAMINAR_TITLE
Page title to show in web frontend
LAMINAR_TITLE=SBRL CI Service - Laminar
LAMINAR_KEEP_RUNDIRS
Setting this prevents the immediate deletion of job rundirs
$LAMINAR_HOME/run/$JOB/$RUN. Value should be an integer represeting
the number of rundirs to keep.
Default: 0
#LAMINAR_KEEP_RUNDIRS=0
LAMINAR_BASE_URL
Base url for the frontend. This affects the tag and needs
to be set if Laminar runs behind a reverse-proxy that hosts Laminar
within a subfolder (rather than at a subdomain root)
#LAMINAR_BASE_URL=/
LAMINAR_ARCHIVE_URL
Base url used to request artifacts. Laminar can serve build
artifacts (and it will if you leave this unset), but it
uses a very naive and inefficient method. Best to let a real
webserver handle serving those requests.
LAMINAR_ARCHIVE_URL=https://ci.starbeamrainbowlabs.com/archive/
Relevant entries from /var/log/syslog
Feb 23 21:13:42 eldarion consul[948]: 2021-02-23T21:13:42.103Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:13:42 eldarion consul[948]: 2021-02-23T21:13:42.518Z [INFO] agent: Synced check: check=laminar-http
Feb 23 21:14:47 eldarion consul[948]: 2021-02-23T21:14:47.104Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:15:52 eldarion consul[948]: 2021-02-23T21:15:52.107Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:15:56 eldarion kernel: [429713.731978] INFO: task laminard:787 blocked for more than 122 seconds.
Feb 23 21:15:56 eldarion kernel: [429713.732027] task:laminard state:D stack: 0 pid: 787 ppid: 1 flags:0x00000001
Feb 23 21:16:57 eldarion consul[948]: 2021-02-23T21:16:57.108Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:17:59 eldarion kernel: [429836.614718] INFO: task laminard:787 blocked for more than 245 seconds.
Feb 23 21:17:59 eldarion kernel: [429836.614766] task:laminard state:D stack: 0 pid: 787 ppid: 1 flags:0x00000001
Feb 23 21:18:02 eldarion consul[948]: 2021-02-23T21:18:02.109Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:19:07 eldarion consul[948]: 2021-02-23T21:19:07.111Z [WARN] agent: Check is now critical: check=laminar-http
Feb 23 21:20:01 eldarion kernel: [429959.495982] INFO: task laminard:787 blocked for more than 368 seconds.
Feb 23 21:20:01 eldarion kernel: [429959.496030] task:laminard state:D stack: 0 pid: 787 ppid: 1 flags:0x00000001
Feb 23 21:20:12 eldarion consul[948]: 2021-02-23T21:20:12.112Z [WARN] agent: Check is now critical: check=laminar-http
Edit 2: lsof of laminard while deadlocked:
Hmm. Laminar is single threaded so cannot deadlock in the usual sense. Most likely it is stuck in a system call. It would be interesting if you could launch laminar under strace
, but note that that requires patching run.cpp
to call the real laminard (usually /usr/sbin/laminard
) instead of /proc/self/exe
.
Are you using the SD card on your Pi? They are notoriously unreliable, and if laminar's workdir is mounted there this might be the cause of the problem - block layer operations are typical for causing the kind of freeze that kill -9
won't get you out of.
As well as the strace test, I would suggest trying to mount laminar's home (including sqlite database) on a trustworthy external storage and see if the problem is reproducible.
Oops, didn't see that you'd replied here! Sorry about the wait.
I'd be happy to switch to a new binary if it were precompiled.
Nope, I'm actually using an external HDD. It passed it's latest SMART test too. I store laminar data in /srv/laminar
, which is a symlink to a location on the external 1TB WD PiDrive.
laminar_1.0-10-g9b8c3762-dirty-1~upstream-debian10_armhf.zip
Here's a precompiled copy for you that can be run under strace. strace will produce lots of output, but it would be useful to see the last few lines before it locks up.
Thanks! I'll keep my laminar instance under observation. After your last comment, I discovered an ext4 filesystem corruption issue on the drive I stored my laminar data on. It was very strange though, because:
- fsck.ext4 came back clean
- A read-only badblocks test was clean
- Taking a copy of the disk and uploading it to another machine where I mounted it via loopback worked fine
...so I've copied the laminar directory off and mounted it onto the CI server via NFS, and I'm keeping a close eye on it. If it locks up again, I'll try strace
ing it with the binary you've provided.
I'll close this issue for now, but if it happens again and I strace
it, I'll reopen with more info.
Thanks for the help!