rapier1/hpn-ssh

pthreads segfault on RHEL 8.5

somewhere-or-other opened this issue · 10 comments

I'm building on a RHEL 8.5 image, and keep running into segfaults in the child process after a connection is made and authenticated. I'm not sure if the problem is yours, or something having changed with pthreads, etc. I thought I'd post about it here, and see what happens. If I'm doing something wrong, I'm happy to take feedback.

I've encountered this problem with the master branch (as of commit ebf1fee). Basically, when I launch the sshd daemon (/usr/local/openssh-hpn/master/sbin/sshd -ddd -p 2200 -f /etc/ssh/sshd_config, in this case), it runs and waits for the connection. When I connect from another host, it gets all the way through the authentication, and then the child process that it fork()ed off, segfaults (backtrace below), and the connection closes.

For reference, this is on RHEL 8.5, with GCC 8.5.0, glibc-2.28-164.el8. I manually ran the configure/make/make install, with the following syntax on the configure line:

./configure --prefix=/usr/local/openssh-hpn/master --sysconfdir=/etc/ssh/ --with-default-path=/usr/local/bin:/bin:/usr/bin --with-superuser-path=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin --with-md5-passwords --with-pam --with-privsep-path=/var/empty/sshd --with-libedit --with-xauth=/usr/bin/xauth --disable-strip

When I use gdb and the core file generated to get a backtrace, here's what I find:

(gdb) bt
#0  __pthread_cancel (th=0) at pthread_cancel.c:33
#1  0x0000561e20178d77 in stop_and_join_pregen_threads (c=c@entry=0x7f77a8ae3010) at cipher-ctr-mt.c:221
#2  0x0000561e20178e8e in ssh_aes_ctr_cleanup (ctx=0x561e21baf280) at cipher-ctr-mt.c:638
#3  0x00007f77b0cee534 in EVP_CIPHER_CTX_reset () from /lib64/libcrypto.so.1.1
#4  0x00007f77b0cee64d in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.1
#5  0x0000561e20178767 in cipher_init (ccp=ccp@entry=0x561e21b91858, cipher=0x561e2040b400 <ciphers+160>, 
    key=0x561e21b86b70 "\301\367\">e\255\273\235\353Q\363b@{,\314\020\314\303\020\365\231\357\324\364\351\036P\274\215\n}", keylen=16, 
    iv=0x561e21bc3e30 "\302\002F\330.>", ivlen=<optimized out>, do_encrypt=1) at cipher.c:357
#6  0x0000561e2017ffb8 in ssh_set_newkeys (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1) at packet.c:914
#7  0x0000561e201808ef in ssh_packet_send2_wrapped (ssh=ssh@entry=0x561e21b96540) at packet.c:1252
#8  0x0000561e20180988 in ssh_packet_send2 (ssh=0x561e21b96540) at packet.c:1319
#9  0x0000561e2018213b in sshpkt_send (ssh=ssh@entry=0x561e21b96540) at packet.c:2741
#10 0x0000561e20197970 in kex_send_newkeys (ssh=ssh@entry=0x561e21b96540) at kex.c:460
#11 0x0000561e2019ad0c in input_kex_gen_init (type=<optimized out>, seq=<optimized out>, ssh=0x561e21b96540) at kexgen.c:337
#12 0x0000561e2018928a in ssh_dispatch_run (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:113
#13 0x0000561e20189359 in ssh_dispatch_run_fatal (ssh=ssh@entry=0x561e21b96540, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:133
#14 0x0000561e20136d1f in process_buffered_input_packets (ssh=0x561e21b96540) at serverloop.c:365
#15 server_loop2 (ssh=ssh@entry=0x561e21b96540, authctxt=authctxt@entry=0x561e21b98090) at serverloop.c:365
#16 0x0000561e2014106f in do_authenticated2 (authctxt=0x561e21b98090, ssh=0x561e21b96540) at session.c:2642
#17 do_authenticated (ssh=0x561e21b96540, authctxt=0x561e21b98090) at session.c:365
#18 0x0000561e20127ac1 in main (ac=<optimized out>, av=<optimized out>) at sshd.c:2343
(gdb)

If there are further debugging steps I can take to help isolate this problem, please let me know. I may be more of a sysadmin than a developer, but I'll do my best to follow instructions.

Lloyd

Lloyd,

I just got RHEL 8.5 running on a VM. This is fresh out of the box with only the updates applied and the necessary libraries (getting libedit-devel was annoying though). I built it with the configuration you gave me. The only thing I did different than you is run autoconf before ./configure.

I wasn't not able to recreate the problem. I tried a few different configurations, settings, and ciphers and everything was working as expected. Did you make any other changes?

This is an NFS-rooted image for deployment on a large HPC cluster. There have been several things that I've had to customize, but I can't think of anything in particular that would affect this. Would it make sense to compare versions numbers of specific packages? I'm not sure which would be the most relevant, but I'm happy to try that.

I did do a bunch of aclocal/autoconf/automake/etc as well. Sorry I didn't document that. I guess I assumed it went without saying.

I'm re-cloning again from scratch, to see if there's anything I accidentally did in the repository that might've had an effect. I tried building based on at least 2 other git tags before using the master branch, so it's possible there was something residual. I'll get back here shortly with the result.

Hmm. Unfortunately I'm getting the same result, after using this newly-cloned copy of the repository:

git clone https://github.com/rapier1/openssh-portable.git openssh-hpn-2
cd openssh-hpn-2/
aclocal
autoheader
autoconf
./configure --prefix=/usr/local/openssh-hpn/master --sysconfdir=/etc/ssh/ --with-default-path=/usr/local/bin:/bin:/usr/bin --with-superuser-path=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin --with-md5-passwords --with-pam --with-privsep-path=/var/empty/sshd --with-libedit --with-xauth=/usr/bin/xauth --disable-strip
make
make install

It's a long-shot, but could it be affected by FIPS mode? I do have fips=1 on my kernel command-line. I wouldn't think it's relevant, now that OpenSSH uses OpenSSL for it's crypto, but it's possible. I'd think we'd have other symptoms (eg. kernel errors complaining about invalid/unapproved algorithms), if that were the problem.

There could certainly be others, but here are the versions of all the packages that provides any of the paths, in the output of ldd. Can you think of anything else that might be worth comparing?

# for i in `ldd /usr/local/openssh-hpn/master/sbin/sshd | awk '{print $3}'`; do rpm -q --whatprovides "$i"; done | sort -u
audit-libs-3.0-0.17.20191104git1c2f876.el8.x86_64
glibc-2.28-164.el8.x86_64
libcap-ng-0.7.11-1.el8.x86_64
libxcrypt-4.1.1-6.el8.x86_64
openssl-libs-1.1.1k-5.el8_5.x86_64
pam-1.3.1-15.el8.x86_64
zlib-1.2.11-17.el8.x86_64
# 

After I rebooted the node without the fips=1 I no longer see the problem occurring. I'm able to log in normally.

I'm going to keep testing, and see if I can figure out anything further about what's going on. For reference, this page is RH's official documentation about how to enable FIPS mode, in case you want to verify my findings.

I know that with RHEL7, which shipped OpenSSH 7.4p1, OpenSSH was included in the list of packages that had to be certified for FIPS mode compliance, but with RHEL8, which shipped OpenSSH 8.0p1, it was no longer included. I had heard that OpenSSH had started using OpenSSL libs exclusively for it's crypto setup, which would explain the change between RHEL7 and RHEL8. I had assumed that would still be true with your HPN-modified code, as long as it was based on something >= OpenSSH v 8.0, but perhaps that isn't a correct assumption.

I'm not suggesting that you necessarily need to fix this, or anything. Just trying to understand the situation, and what the limitations are. Deciding to explicitly not support FIPS mode, is a totally understandable response.

Lloyd

Chris,

Thank you. I can confirm with FIPS mode on, launching using the syntax below, that I can connect successfully with a non-HPN client, which I was not able to do before.

/usr/local/openssh-hpn/master/sbin/sshd -oDisableMTAES=yes -ddd -p 2200 -f /etc/ssh/sshd_config

That will probably be an acceptable workaround for my purposes for now, though I am also curious what happens with your further investigations. But I totally understand about the uncertain timeline.

Lloyd

I am getting what appears to be this same issue on RHEL8 (OpenSSL 1.1.1k, no ability to pull OpenSSL3) FIPS boxes using the latest HPNSSH 18.4.1. Here's my findings in general:

  • The DisableMTAES flag is no longer in the codebase at all. It's mentioned only in manpages. This regression(?) occurred between the tags hpn-9_2_P2 and hpn-9_3_P1.
  • I tried cherry-picking the DisableMTAES commits onto hpn-18.4.1, there's enough conflicts that I gave up without really attempting to resolve them (though I might take a look if needed).
  • Building hpn-9_2_P2 and using DisableMTAES everywhere (client and server) did not seem to fix the issue, though we didn't test this as extensively as the newer versions.
  • FIPS clients (hpnssh binary, FIPS status confirmed via fips-mode-setup --check) are always broken: typically, I can get a login prompt for the remote host, and then immediately segfault upon successful auth.
  • Non-FIPS clients connecting to non-FIPS servers work just fine.
  • Non-FIPS clients connecting to FIPS servers result in successful auth and an immediate Connection closed, but no core dumps in coredumpctl, or really anything looking out of the ordinary in journalctl -xe.
  • All of these FIPS-related issues disappear with --without-openssl, but then only ED25519 key types seem to be supported, hpnssh-keygen -t rsa complains of being an unknown key type. Not dynamically linking OpenSSL would be a pretty huge negative for our usecase, so I'd ideally like to avoid --without-openssl anyway, but if it's strictly required, we still need RSA key support to work somehow.

At this point, I'm a bit lost where to continue looking or how to resolve this one (as are various teammates who've been helping debug this), and so I'd like to reopen this issue thread for some advice/pointers, and to try to help contribute to a fix if I can. Thanks! Below is a stack trace from GDB if it helps.

Program received signal SIGSEGV, Segmentation fault.
                                                    0x00007ffff710db54 in pthread_cancel () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00007ffff710db54 in pthread_cancel () from /lib64/libpthread.so.0
#1  0x00005555555a73d5 in stop_and_join_pregen_threads ()
#2  0x00005555555a77de in ssh_aes_ctr_cleanup ()
#3  0x00007ffff6b2d534 in EVP_CIPHER_CTX_reset () from /lib64/libcrypto.so.1.1
#4  0x00007ffff6b2d64d in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.1
#5  0x00005555555a6dba in cipher_init ()
#6  0x00005555555ae9b3 in ssh_set_newkeys ()
#7  0x00005555555b0d89 in ssh_packet_send2_wrapped ()
#8  0x00005555555b0e78 in ssh_packet_send2 ()
#9  0x00005555555d0a00 in kex_send_newkeys ()
#10 0x00005555555d4692 in input_kex_gen_reply ()
#11 0x00005555555b763a in ssh_dispatch_run ()
#12 0x00005555555b7709 in ssh_dispatch_run_fatal ()
#13 0x00005555555776a2 in client_loop ()
#14 0x000055555556460b in main ()
(gdb)

And for version info:

OpenSSH_9.7p1-hpn18.4.1, OpenSSL 1.1.1k  FIPS 25 Mar 2021

So I forgot about FIPS puking on the multithreaded AES. I've accepted your PR and it's moving through the process of making it into master. Have you seen any problems with the default chacha20 cipher? Just curious as that's threaded as well.