CRITICAL: Hitch crashed production server because of one faulty certificate pem file
Opened this issue · 6 comments
Expected Behavior
Expected Hitch to just ignore the faulty pem certificate and run happily.
Current Behavior
Mar 17 12:46:36 web2 hitch[2813]: 20220317T124636.810693 [ 2813] {core} hitch 1.6.1 starting
Mar 17 12:46:36 web2 hitch[2813]: 20220317T124636.812323 [ 2813] {core} Loading certificate pem files (11)
Mar 17 12:46:36 web2 systemd[1]: hitch.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit hitch.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Mar 17 12:46:36 web2 systemd[1]: hitch.service: Failed with result 'exit-code'.
Possible Solution
Just ignore the faulty pem file but keep on running with the correct ones.
Steps to Reproduce (for bugs)
put bogus pem file in directory where they are read from:
settings in conf file:
pem-dir = "/lego/certificates"
pem-dir-glob = "*.pem"
Context
Very nasty; all production websites down for a while.
Your Environment
Debain; everything fairly up to date.
hitch 1.6.1 (installed with: sudo apt install hitch )
If this was fixed after version 1.6.1, we sincerely apologise for this bug report, and, as such, hope Debian will have its packages more up-to-date
Thanks for making such a great piece of software,
Dennis Gaastra
We have had this issue from time to time. A partially-created or missing pem file will cause hitch to crash upon restart. Usually this is followed by a scramble to identify the offending line from the service hitch status
and comment it out of the hitch.conf
and restart hitch.
We have other servers where SSL is terminated with nginx. An nginx -t
is fairly robust to check the configuration files and will report on missing or flawed files before we attempt to restart nginx.
The equivalent hitch -t
only seems to check that the hitch.conf
is syntactically correct. This is only part of the issue. It certainly knows there is a problem when it attempts to restart. Why not some kind of dry run option to prevent problems?
I wrote a small script to at least check and see that the file mentioned in the pem lines exists.
James D. Keeline
#!/bin/bash
HITCH=/etc/hitch/hitch.conf
ERR=0
hitch -t || ERR=1
for PEM in $(grep ^pem $HITCH | awk -F'"' '{print $2}')
do
if [ ! -f "$PEM" ]; then
echo "$PEM missing"
ERR=2
fi
done
if [ $ERR -gt 0 ]; then
echo "Errors found [$ERR]. Do not restart hitch."
exit 1
else
echo "Scan of $HITCH done. It should be OK to restart hitch."
fi
Thanks for the script, but we really need the hitch developers to "Just ignore the faulty pem file but keep on running with the correct ones."
Apologies for taking my time in getting back to you here.
I'm sorry to say I'm struggling to reproduce this - even when trying 1.6.1. Adding bogus files to a pem-dir
or adding a pem-file
entry pointing at a missing file just yields Config reload failed
with the service still running on the previous config.
Any way you could come up with a reproducer?
Hi Dag, thanks for looking into this. We have
pem-dir = "/htdocs/admin/lego/certificates"
pem-dir-glob = "*.pem"
Our PEMs are typically in the following format:
-----BEGIN CERTIFICATE-----
C1...
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
C2...
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
C3...
-----END CERTIFICATE-----
-----BEGIN RSA PRIVATE KEY-----
P1
-----END RSA PRIVATE KEY-----
-----BEGIN DH PARAMETERS-----
D1
-----END DH PARAMETERS-----
-----BEGIN DH PARAMETERS-----
D2
-----END DH PARAMETERS-----
Try to leave one or more of the sections C1-C3 or P1 or D1-2 out and see what happens. I don't exactly remember the bogus PEM in great detail, however, next time, will take a note of it when it happens again. Maybe try with leaving P1 out.
Thanks so kindly,
Dennis
Normally, I will run
hitch -t--config=/etc/hitch/hitch.conf
to check all certs before reload/restart