rapier1/hpn-ssh

ssh-hpn v14.16 hangs in multithreaded AES

walterdejong opened this issue · 15 comments

Built hpn-ssh v14.16 on a SLES12 system like this:

./configure  --prefix=$HOME/sw \
        --with-privsep-path=/var/lib/empty \
        --with-pam
make
make tests

The tests run for while, but hang on aes128-ctr:

...
test try ciphers: cipher rijndael-cbc@lysator.liu.se mac umac-64-etm@openssh.com
test try ciphers: cipher rijndael-cbc@lysator.liu.se mac umac-128-etm@openssh.com
test try ciphers: cipher aes128-ctr mac hmac-sha1

Killing the stuck process results the test failing, and then make tests will hang on the next test for aes128-ctr.

gdb backtrace on a stuck process looks like this:

(gdb) bt
#0  0x00007fdce71ceb7c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fdce71cc910 in pthread_cond_broadcast@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x000055f1662b4644 in stop_and_join_pregen_threads (c=c@entry=0x7fdce41a0010) at cipher-ctr-mt.c:246
#3  0x000055f1662b4768 in ssh_aes_ctr_cleanup (ctx=0x55f167074700) at cipher-ctr-mt.c:632
#4  0x00007fdce7f71c57 in EVP_CIPHER_CTX_cleanup () from /lib64/libcrypto.so.1.0.0
#5  0x00007fdce7f71d5e in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.0.0
#6  0x000055f1662b41e8 in cipher_free (cc=0x55f167077770) at cipher.c:447
#7  0x000055f1662bb38a in ssh_packet_close_internal (ssh=0x55f167074be0, do_close=do_close@entry=0) at packet.c:634
#8  0x000055f1662bb4cf in ssh_packet_clear_keys (ssh=<optimized out>) at packet.c:655
#9  0x000055f16628424e in do_child (ssh=ssh@entry=0x55f167074be0, s=s@entry=0x55f167079ee0,
    command=command@entry=0x55f167077bb0 "true") at session.c:1545
#10 0x000055f166285d3c in do_exec_no_pty (ssh=0x55f167074be0, s=0x55f167079ee0, command=0x55f167077bb0 "true")
    at session.c:513
#11 0x000055f1662872d5 in do_exec (ssh=ssh@entry=0x55f167074be0, s=s@entry=0x55f167079ee0, command=<optimized out>,
    command@entry=0x55f167077bb0 "true") at session.c:756

So the origin of the bug seems to be the AES-MT patch in ssh-hpn.

System is SLES12 SP3:

$ cat /etc/os-release   
NAME="SLES"
VERSION="12-SP3"
VERSION_ID="12.3"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp3"

Software versions that come with SLES12:

libopenssl-devel-1.0.2j-60.46.1.x86_64
gcc-4.8-6.189.x86_64

Workaround: put disableMTaes=yes in sshd_config or build without the AES-MT patch.
But obviously then you don't get any speedup with the cipher.

In SuSE all packages for doing development are in the SDK:

$ zypper pa |grep pam-devel
i+ | SLE-SDK12-SP3-Updates                  | pam-devel                                       | 1.1.8-24.6.1                                    | x86_64
v  | SLE-SDK12-SP3-Updates                  | pam-devel                                       | 1.1.8-24.3.1                                    | x86_64
v  | SLE-SDK12-SP3-Pool                     | pam-devel                                       | 1.1.8-23.1                                      | x86_64

I have now built without --with-pam, and it hangs. Some debug info:

(gdb) thread apply all bt

Thread 1 (Thread 0x7fc54ffa7700 (LWP 27186)):
#0  0x00007fc54ecc5b7c in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fc54ecc3910 in pthread_cond_broadcast@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#2  0x0000563bff6b7034 in stop_and_join_pregen_threads (c=c@entry=0x7fc54beba010) at cipher-ctr-mt.c:246
#3  0x0000563bff6b7158 in ssh_aes_ctr_cleanup (ctx=0x563bfff27580) at cipher-ctr-mt.c:632
#4  0x00007fc54fa68c57 in EVP_CIPHER_CTX_cleanup () from /lib64/libcrypto.so.1.0.0
#5  0x00007fc54fa68d5e in EVP_CIPHER_CTX_free () from /lib64/libcrypto.so.1.0.0
#6  0x0000563bff6b6bd8 in cipher_free (cc=0x563bfff27220) at cipher.c:447
#7  0x0000563bff6bdd7a in ssh_packet_close_internal (ssh=0x563bfff21bb0, do_close=do_close@entry=0) at packet.c:634
#8  0x0000563bff6bdebf in ssh_packet_clear_keys (ssh=<optimized out>) at packet.c:655
#9  0x0000563bff68a83e in do_child (ssh=ssh@entry=0x563bfff21bb0, s=s@entry=0x563bfff27e80, command=command@entry=0x563bfff218f0 "true") at session.c:1545
#10 0x0000563bff68c2bc in do_exec_no_pty (ssh=0x563bfff21bb0, s=0x563bfff27e80, command=0x563bfff218f0 "true") at session.c:513
#11 0x0000563bff68d7e5 in do_exec (ssh=ssh@entry=0x563bfff21bb0, s=s@entry=0x563bfff27e80, command=<optimized out>, command@entry=0x563bfff218f0 "true")
    at session.c:756
#12 0x0000563bff68dca7 in session_exec_req (s=0x563bfff27e80, ssh=0x563bfff21bb0) at session.c:2053
#13 session_input_channel_req (ssh=ssh@entry=0x563bfff21bb0, c=c@entry=0x563bfff25880, rtype=<optimized out>, rtype@entry=0x563bfff27030 "exec")
    at session.c:2144
#14 0x0000563bff68236e in server_input_channel_req (type=<optimized out>, seq=<optimized out>, ssh=0x563bfff21bb0) at serverloop.c:915
#15 0x0000563bff6c5914 in ssh_dispatch_run (ssh=ssh@entry=0x563bfff21bb0, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:113
#16 0x0000563bff6c597d in ssh_dispatch_run_fatal (ssh=ssh@entry=0x563bfff21bb0, mode=mode@entry=1, done=done@entry=0x0) at dispatch.c:133
#17 0x0000563bff683abf in process_buffered_input_packets (ssh=0x563bfff21bb0) at serverloop.c:371
#18 server_loop2 (ssh=ssh@entry=0x563bfff21bb0, authctxt=authctxt@entry=0x563bfff23a60) at serverloop.c:427
#19 0x0000563bff68ce8d in do_authenticated2 (authctxt=0x563bfff23a60, ssh=0x563bfff21bb0) at session.c:2563
#20 do_authenticated (ssh=0x563bfff21bb0, authctxt=0x563bfff23a60) at session.c:367
#21 0x0000563bff6773c7 in main (ac=<optimized out>, av=<optimized out>) at sshd.c:2251
(gdb) 

(gdb) up 2
#2  0x0000563bff6b7034 in stop_and_join_pregen_threads (c=c@entry=0x7fc54beba010) at cipher-ctr-mt.c:246
246                     pthread_cond_broadcast(&c->q[i].cond);
(gdb) info locals
i = 0
__FUNCTION__ = "stop_and_join_pregen_threads"
(gdb) info args
c = 0x7fc54beba010

(gdb) print cipher_threads
$1 = 2
(gdb) print numkq
$2 = 4

(gdb) print c->q[i].qstate
$3 = 4
(gdb) print c->q[0].qstate
$4 = 4
121 /* Keystream Queue state */
122 enum {
123     KQINIT,
124     KQEMPTY,
125     KQFILLING,
126     KQFULL,
127     KQDRAINING
128 };

239     /* Cancel pregen threads */
240     for (i = 0; i < cipher_threads; i++) {
241         debug ("Canceled %lu (%d,%d)", c->tid[i], c->struct_id, c->id[i]);
242         pthread_cancel(c->tid[i]);
243     }
244     for (i = 0; i < numkq; i++) {
245         pthread_mutex_lock(&c->q[i].lock);
246         pthread_cond_broadcast(&c->q[i].cond);
247         pthread_mutex_unlock(&c->q[i].lock);
248     }
249     for (i = 0; i < cipher_threads; i++) {
250         if (pthread_kill(c->tid[i], 0) != 0)
251             debug3("AES-CTR MT pthread_join failure: Invalid thread id %lu in %s", c->tid[i], __FUNCTION__);
252         else {
253             debug ("Joining %lu (%d, %d)", c->tid[i], c->struct_id, c->id[i]);
254             pthread_join(c->tid[i], NULL);
255         }
256     }

Hope this helps.

A little bit more debug info;

server side:

$ /home/walter/sw/sbin/sshd -d -D -p 8888

client side:

$ ~/sw/bin/ssh -vvv -p 8888 -c aes128-ctr localhost

Connection actually works and nothing hangs now (?).
Hit Ctrl-D to exit the shell, these debug msgs look interesting:

server side:

Received disconnect from ::1 port 52162:11: disconnected by user
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 140206363002624 (1,0)
debug1: Canceled 140206354609920 (1,1)
debug1: Joining 140206363002624 (1, 0)
debug1: AES-CTR MT tid 140206363002624 - 0 fills, 0 skips, 0 waits
debug1: AES-CTR MT tid 140206354609920 - 0 fills, 0 skips, 0 waits
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 140206342006528 (2,0)
debug1: AES-CTR MT tid 140206342006528 - 0 fills, 0 skips, 0 waits
debug1: Canceled 140206333613824 (2,1)
debug1: AES-CTR MT tid 140206333613824 - 0 fills, 0 skips, 0 waits
Disconnected from user walter ::1 port 52162
debug1: do_cleanup
debug1: do_cleanup

client side:

debug3: send packet: type 1
debug3: fd 1 is not O_NONBLOCK
Connection to localhost closed.
Transferred: sent 3676, received 6012 bytes, in 10.2 seconds
Bytes per second: sent 360.6, received 589.8
debug1: Exit status 0
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 139834912229120 (1,0)
debug1: Canceled 139834903836416 (1,1)
debug1: Joining 139834912229120 (1, 0)
debug1: AES-CTR MT tid 139834912229120 - 0 fills, 0 skips, 0 waits
debug1: AES-CTR MT tid 139834903836416 - 0 fills, 0 skips, 0 waits
debug3: AES-CTR MT pthread_join failure: Invalid thread id 139834903836416 in stop_and_join_pregen_threads
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 139834891233024 (2,0)
debug1: Canceled 139834882840320 (2,1)
debug1: Joining 139834891233024 (2, 0)
debug1: AES-CTR MT tid 139834891233024 - 0 fills, 0 skips, 0 waits
debug1: AES-CTR MT tid 139834882840320 - 0 fills, 0 skips, 0 waits
debug3: AES-CTR MT pthread_join failure: Invalid thread id 139834882840320 in stop_and_join_pregen_threads

Note the pthread_join failures.

I started over with a fresh git clone and checkout,

$ git status
HEAD detached at 30ec037c

Gives same result, same gdb trace.

Is there perhaps an easy way to test only the cipher? make tests takes quite a while to get there.

Hi Chris,

It still hangs.
IMO you should never use trylock in combination with a condition variable (?)

stop_and_join_pregen_threads() gets called by ssh_aes_ctr_cleanup().
Are you sure ssh_aes_ctr_cleanup() gets called only by the main thread, or does it get called by all threads? (and is that intended?)
Note that in gdb I traced all threads, but only one active thread is found, so there is no one to broadcast to. Apparently the other thread(s) have already exited.

Also, you cancel the thread and next there is a broadcast. Shouldn't this be the other way around?
I suppose you could also make a state to signal (tell) the thread to exit, and do without pthread_cancel() altogether.

My 2 cents,

chutz commented

It still hangs.
IMO you should never use trylock in combination with a condition variable (?)

Why? The trylock is on a mutex, not a condition variable...

stop_and_join_pregen_threads() gets called by ssh_aes_ctr_cleanup().
Are you sure ssh_aes_ctr_cleanup() gets called only by the main thread, or does it get called by all threads? (and is that intended?)

Here is a patch that should both check if it is called by other threads, and make it a noop if it is, probably more useful for testing than anything else:

--- openssh-7.9p1-hpn.orig/cipher-ctr-mt.c	2018-12-21 10:35:46.473849739 -0800
+++ openssh-7.9p1-hpn/cipher-ctr-mt.c	2018-12-21 11:04:29.411115028 -0800
@@ -146,6 +146,7 @@
 	AES_KEY		aes_ctx;
 	STATS_STRUCT(stats);
 	u_char		aes_counter[AES_BLOCK_SIZE];
+	pthread_t	main_tid;
 	pthread_t	tid[MAX_THREADS];
 	int             id[MAX_THREADS];
 	pthread_rwlock_t tid_lock;
@@ -229,6 +230,11 @@
 {
 	int i;
 
+	if (c->main_tid != pthread_self()) {
+		logit("WARNING: AES-CTR MT stop_and_join_pregen_threads called from worker thread %lu", pthread_self());
+		return;
+	}
+
 #ifdef __APPLE__
 	/* notify threads that they should exit */
 	pthread_rwlock_wrlock(&c->stop_lock);
@@ -564,6 +570,7 @@
 		}
 		c->qidx = 0;
 		c->ridx = 0;
+		c->main_tid = pthread_self();
 
 		/* Start threads */
 		for (i = 0; i < cipher_threads; i++) {

Note that in gdb I traced all threads, but only one active thread is found, so there is no one to broadcast to. Apparently the other thread(s) have already exited.

Also, you cancel the thread and next there is a broadcast. Shouldn't this be the other way around?

The broadcast is to wake up all the threads so they notice they have been cancelled, a thread has to reach a cancellation point before the cancellation happens, the broadcast is to expedite this.

I suppose you could also make a state to signal (tell) the thread to exit, and do without pthread_cancel() altogether.

There is already code in there to do exactly this, gated under #ifdef __APPLE__ since it seems that on OSX, threads sometimes fail to notice that they have been cancelled. Testing a condition variable in the main thread loop does have a performance cost, you have an extra lock to take at every iteration. Locks are expensive and pthread_cancel exists for a reason.

I have applied your patch with main_tid. failed-sshd.log shows:

...
debug2: set_newkeys: mode 1
debug1: set_newkeys: rekeying, input 2760 bytes 52 blocks, output 2696 bytes 90 blocks
debug1: AES-CTR MT spawned a thread with id 140442298963712 in ssh_aes_ctr_init (0, 0)
debug1: AES-CTR MT spawned a thread with id 140442290571008 in ssh_aes_ctr_init (1, 1)
debug1: rekey after 4294967296 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: receive packet: type 21
debug1: SSH2_MSG_NEWKEYS received
debug2: set_newkeys: mode 0
debug1: set_newkeys: rekeying, input 2776 bytes 53 blocks, output 2696 bytes 0 blocks
debug1: AES-CTR MT spawned a thread with id 140442277967616 in ssh_aes_ctr_init (2, 0)
debug1: AES-CTR MT spawned a thread with id 140442269574912 in ssh_aes_ctr_init (2, 1)
debug1: rekey after 4294967296 blocks
debug3: receive packet: type 90
debug1: server_input_channel_open: ctype session rchan 0 win 2097152 max 32768
...
Starting session: command for walter from UNKNOWN port 65535 id 0  
debug2: fd 13 setting O_NONBLOCK
debug2: fd 12 setting O_NONBLOCK
debug2: fd 15 setting O_NONBLOCK
debug3: send packet: type 99
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 140442298963712 (1,0)
debug1: Canceled 140442290571008 (1,1)
debug3: AES-CTR MT pthread_join failure: Invalid thread id 140442298963712 in stop_and_join_pregen_threads
debug3: AES-CTR MT pthread_join failure: Invalid thread id 140442290571008 in stop_and_join_pregen_threads
debug1: AES-CTR MT main thread: 0 drains, 0 waits
debug1: Canceled 140442277967616 (2,0)
debug1: Canceled 140442269574912 (2,1)
FATAL:
FAIL:

This log is not written until I end it with Ctrl-C.
Complete logs are in the attached .tar.gz

regress_logs.tar.gz

This is very old but I believe this has been fixed with the latest round of updates. These should have been backported to 7.6-8.1 inclusive