liquidprompt/liquidprompt

LP hangs in tmux detection

Farom opened this issue · 16 comments

When having different versions of tmux (old server, new client) the tmux detection can hang for an indefinite period.
This blocks all interaction with shell.

Shell: 5.2.15(1)-release
Operating system: Debian bullseye upgrade to bookworm (Kernel 5.10.0-26-amd64)
Liquid Prompt version: b1ea560 (master on 2023-11-13)

Steps to Reproduce

  1. Start a tmux session with tmux version 3.1c (from debian oldstable/bullseye)
  2. upgrade tmux to version 3.3a (from debian bookworm)
  3. All Shells running with LP hang for an indefinite period. CTRL+C without effect.

Expected Behavior

I expect that at some timeout using the shell is possible.

Current Behavior

It hangs forever. No Interaction with shell possible.

Possible Solution / Workaround

in Line

  (( _LP_ENABLE_TMUX )) && count+=$(tmux list-sessions 2> /dev/null | GREP_OPTIONS='' \grep -cv 'attached')

start grep with timeout 1

This patch works for me:

diff --git a/liquidprompt b/liquidprompt
index 8a6bfbb..3bcc299 100755
--- a/liquidprompt
+++ b/liquidprompt
@@ -2476,7 +2476,7 @@ _lp_detached_sessions() {
 
     local -i count=0
     (( _LP_ENABLE_SCREEN )) && count=$(screen -ls 2> /dev/null | GREP_OPTIONS='' \grep -c '[Dd]etach[^)]*)$')
-    (( _LP_ENABLE_TMUX )) && count+=$(tmux list-sessions 2> /dev/null | GREP_OPTIONS='' \grep -cv 'attached')
+    (( _LP_ENABLE_TMUX )) && count+=$(tmux list-sessions 2> /dev/null | GREP_OPTIONS='' \timeout 1 \grep -cv 'attached')
     lp_detached_sessions=$count
 
     (( lp_detached_sessions ))

timeout comes from coreutils.

Hopefully you still have that tmux session around. Or have a way to reproduce this with your patch intact.

What happens if you run tmux list-sessions? I think this is likely a bug in tmux, where the newer version hangs on a search where it finds an old version still running.

I do not think your patch will be possible to merge. timeout is not something liquidprompt already uses, and I do not want to add more binary dependencies that not all systems will have.

Hey @Rycieos,

we debugged the issue and found out the tmux list-sessions writes the session information to an anonymous pipe that is connected to stdin of the grep process. If the tmux list-sessions exits the pipe is somehow (we haven't looked into that) transferred to the running tmux "server" process. The pipe is listed in /proc/TMUX_SERVER_PID/fd.

The grep process first reads the session information, and then blocks on a read() in stdin. As the pipe is never closed by the tmux "server" process the grep will hang on the read().

If we use old tmux executable /proc/TMUX_SERVER_PID/exe list-sessions the grep is not hanging.

If we use old tmux executable /proc/TMUX_SERVER_PID/exe list-sessions the grep is not hanging.

This is a possibilty from inside tmux. There is TMUX environment var:

TMUX=/tmp/tmux-1044/default,68128,0

2nd field is the pid of the tmux, so /proc/68128/exe list-sessions does the trick.

but:

When you are not inside a tmux sessions, i do not sea how to find the correct executable.

Hopefully you still have that tmux session around. Or have a way to reproduce this with your patch intact.

I probably can break things for reproducing this bug :-)

What happens if you run tmux list-sessions?

tmux shows the correct output.

I think this is likely a bug in tmux, where the newer version hangs on a search where it finds an old version still running.

Unfortunately it is more complicated and i think it is not worth it to debug tmux detection or make this detection much more complicated.

I do not think your patch will be possible to merge. timeout is not something liquidprompt already uses, and I do not want to add more binary dependencies that not all systems will have.

There are bash builtin possibilities:
http://www.bashcookbook.com/bashinfo/source/bash-4.0/examples/scripts/timeout3

Maybe it is a good feature to limit the execution time of LP in general?
(there are some git situations where waiting for the prompt longer than 20 seconds is really annoying)

I do not think your patch will be possible to merge. timeout is not something liquidprompt already uses, and I do not want to add more binary dependencies that not all systems will have.

I just talked to some embedded Linux people and they told me that timeout from the coreutils package is the tool to use, especially if you are already using tools from coreutils like id and uname.

Even busybox has a timeout with all required (short) options.

I hope you overthink your decision.

we debugged the issue and found out the tmux list-sessions writes the session information to an anonymous pipe that is connected to stdin of the grep process. If the tmux list-sessions exits the pipe is somehow (we haven't looked into that) transferred to the running tmux "server" process. The pipe is listed in /proc/TMUX_SERVER_PID/fd.

I... find that hard to believe. The tmux process transfers the file handle for its stdout to the tmux server? That seems error prone. And it is causing an issue here.

I still think this is a bug in tmux then. Why is the file handle not being closed? This would hang with any process attached to the stdout, like tmux list-sessions | cat, or tmux list-sessions | less.

Could you capture the process tree once it hangs? If you are right, we should see a bash process as the parent, and only grep as a child; the tmux process would have closed.

When you are not inside a tmux sessions, i do not sea how to find the correct executable.

Yes, we need a solution that works in both situations.

What happens if you run tmux list-sessions?

tmux shows the correct output.

What happens if you run tmux list-sessions | cat?

Unfortunately it is more complicated and i think it is not worth it to debug tmux detection or make this detection much more complicated.

Unfortunately I (partially) disagree. I would prefer to see this fixed in tmux. The patch you suggested is not possible for me to merge, as I will explain below.

There are bash builtin possibilities: http://www.bashcookbook.com/bashinfo/source/bash-4.0/examples/scripts/timeout3

That linked solution will not work. The method it uses is creating a subshell that runs the command or function, and killing the subshell if it runs too long. But subshells are separate processes, and cannot set variables that will persist to the main shell. There might be some hack that would allow for returning a value on stdout from the subshell, but that would only work for single commands, not the prompt generation as a whole. And we try to avoid subshells as much as possible as they are very slow on some systems.

Maybe it is a good feature to limit the execution time of LP in general? (there are some git situations where waiting for the prompt longer than 20 seconds is really annoying)

I agree. I think preventing prompt generation from being able to lock up your shell is very important. But as it stands, I do not know of any way to timeout in Bash that will work with functions and passing state around. Neither the timeout binary nor the above linked solution would allow for variables to be set in the shell.

I just talked to some embedded Linux people and they told me that timeout from the coreutils package is the tool to use, especially if you are already using tools from coreutils like id and uname.

Even busybox has a timeout with all required (short) options.

That is interesting, and useful information. However, Liquid Prompt supports more platforms than just Linux. We also support multiple forms of BSD, MacOS, and Windows. MacOS does not have timeout (though it can be installed with brew, but we do not required brew). We can have features that are only supported by some platforms, but a fix like this needs to be portable.

I hope you overthink your decision.

I am always open to rethinking my decisions. I have changed my mind on multiple features and bug fixes in Liquid Prompt in the past.

And to be clear, I am open to adding a fix to Liquid Prompt for this issue (that I am pretty sure is a bug in tmux). The problem is only that your suggested fix is not portable.

I... find that hard to believe. The tmux process transfers the file handle for its stdout to the tmux server? That seems error prone. And it is causing an issue here.

ACK - This only happens if we're using the updated tmux client against the old tmux server.

I still think this is a bug in tmux then. Why is the file handle not being closed? This would hang with any process attached to the stdout, like tmux list-sessions | cat, or tmux list-sessions | less.

Yes, with the above scenario (old server, new client) also causes | cat or | less to hang.

That's an strace of the tmux list-sessions

...
32654 12:37:50.102867 dup(1)            = 6
32654 12:37:50.102903 close(1)          = 0
32654 12:37:50.102927 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLOUT}, {fd=5, events=POLLIN|POLLOUT}], 3, -1) = 2 ([{fd=6, revents=POLLOUT}, {fd=5, revents=POLLOUT}])
32654 12:37:50.102956 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="1\1\0\0\30\0\0\0\10\0\0\0\377\377\377\377\1\0\0\0\0\0\0\0", iov_len=24}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 24
32654 12:37:50.103150 poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}], 2, -1) = 1 ([{fd=5, revents=POLLIN}])
32654 12:37:50.103189 getpid()          = 32654
32654 12:37:50.103212 openat(AT_FDCWD, "/proc/32654/fd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 1
32654 12:37:50.103240 newfstatat(1, "", {st_mode=S_IFDIR|0500, st_size=0, ...}, AT_EMPTY_PATH) = 0
32654 12:37:50.103268 getdents64(1, 0x5574d0e27dc0 /* 9 entries */, 32768) = 216
32654 12:37:50.103296 getdents64(1, 0x5574d0e27dc0 /* 0 entries */, 32768) = 0
32654 12:37:50.103317 close(1)          = 0
32654 12:37:50.103339 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1024*1024}) = 0
32654 12:37:50.103362 recvmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="0\1\0\0O\0\0\0\10\0\0\0\377\377\377\377\1\0\0\0000: 9 windows"..., iov_len=65535}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 99
32654 12:37:50.103393 poll([{fd=3, events=POLLIN}, {fd=6, events=POLLOUT}, {fd=5, events=POLLIN}], 3, -1) = 1 ([{fd=6, revents=POLLOUT}])
32654 12:37:50.103419 writev(6, [{iov_base="0: 9 windows (created Sun Nov 12"..., iov_len=59}], 1) = 59
32654 12:37:50.103465 fcntl(0, F_GETFL) = 0x2 (flags O_RDWR)
32654 12:37:50.103514 fcntl(0, F_SETFL, O_RDWR) = 0
32654 12:37:50.103535 fcntl(1, F_GETFL) = -1 EBADF (Bad file descriptor)
32654 12:37:50.103579 fcntl(2, F_GETFL <unfinished ...>
32654 12:37:50.103593 <... fcntl resumed>) = 0x2 (flags O_RDWR)
32654 12:37:50.103604 fcntl(2, F_SETFL, O_RDWR) = 0
32654 12:37:50.103649 exit_group(0)     = ?
32654 12:37:50.103791 +++ exited with 0 +++
32653 12:37:50.103811 <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 32654

stdout is dup()ed to 6, it writev()s to 6 and then exit.

Here the relevant part of grep. It blocks on read()....:

32655 12:37:50.097690 read(0,  <unfinished ...>                                                                                                                                                                                                                 
32655 12:37:50.103504 <... read resumed>"0: 9 windows (created Sun Nov 12"..., 98304) = 59                                                                                                                                                                      
32655 12:37:50.103587 read(0,  <unfinished ...>                                                                                                                                                                                                                 
32655 12:38:19.573170 <... read resumed>0x564accd3a03b, 98304) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)                                                                                                                                           
32655 12:38:19.573219 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---                                                                                                                                                                                   
32655 12:38:19.573253 read(0,                                                                             

Earlier in tmux list sessions, we see 2 sendmsg() with cmsg_type=SCM_RIGHTS, you can use to transfer file descriptors:

32654 12:37:50.101251 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="d\0\0\0\24\0\0\0\10\0\0\0\377\377\377\377\0\0\1\10", iov_len=20}, {iov_base="o\0\0\0\30\0\0\0\10\0\0\0\377\377\377\377\0\0\1\10\0\0\0\0", iov_len=24}, {iov_base="e\0\0\0\37\0\0\0\10\0\0\0\377\377\377\377xterm-256color\0", iov_len=31}, {iov_base="m\0\0\0\24\0\0\0\10\0\0\0\377\377\377\377\0\0\0\0", iov_len=20}, {iov_base="f\0\0\0\33\0\0\0\10\0\0\0\377\377\377\377/dev/pts/5\0", iov_len=27}, {iov_base="l\0\0\0\26\0\0\0\10\0\0\0\377\377\377\377/root\0", iov_len=22}, {iov_base="p\0\0\0J\0\0\0\10\0\0\0\377\377\377\377acsc=``aaffggiij"..., iov_len=74}, {iov_base="p\0\0\0\25\0\0\0\10\0\0\0\377\377\377\377am=1\0", iov_len=21}, {iov_base="p\0\0\0\25\0\0\0\10\0\0\0\377\377\377\377AX=1\0", iov_len=21}, {iov_base="p\0\0\0\26\0\0\0\10\0\0\0\377\377\377\377bce=1\0", iov_len=22}, {iov_base="p\0\0\0\26\0\0\0\10\0\0\0\377\377\377\377bel=\7\0", iov_len=22}, {iov_base="p\0\0\0\33\0\0\0\10\0\0\0\377\377\377\377blink=\33[5m\0", iov_len=27}, {iov_base="p\0\0\0\32\0\0\0\10\0\0\0\377\377\377\377bold=\33[1m\0", iov_len=26}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377civis=\33[?25l\0", iov_len=29}, {iov_base="p\0\0\0\36\0\0\0\10\0\0\0\377\377\377\377clear=\33[H\33[2J\0", iov_len=30}, {iov_base="p\0\0\0#\0\0\0\10\0\0\0\377\377\377\377cnorm=\33[?12l\33[?2"..., iov_len=35}, {iov_base="p\0\0\0\33\0\0\0\10\0\0\0\377\377\377\377colors=256\0", iov_len=27}, {iov_base="p\0\0\0\32\0\0\0\10\0\0\0\377\377\377\377Cr=\33]112\7\0", iov_len=26}, {iov_base="p\0\0\0\37\0\0\0\10\0\0\0\377\377\377\377Cs=\33]12;%p1%s\7\0", iov_len=31}, {iov_base="p\0\0\0%\0\0\0\10\0\0\0\377\377\377\377csr=\33[%i%p1%d;%p"..., iov_len=37}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377cub=\33[%p1%dD\0", iov_len=29}, {iov_base="p\0\0\0\27\0\0\0\10\0\0\0\377\377\377\377cub1=\10\0", iov_len=23}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377cud=\33[%p1%dB\0", iov_len=29}, {iov_base="p\0\0\0\27\0\0\0\10\0\0\0\377\377\377\377cud1=\n\0", iov_len=23}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377cuf=\33[%p1%dC\0", iov_len=29}, {iov_base="p\0\0\0\31\0\0\0\10\0\0\0\377\377\377\377cuf1=\33[C\0", iov_len=25}, {iov_base="p\0\0\0%\0\0\0\10\0\0\0\377\377\377\377cup=\33[%i%p1%d;%p"..., iov_len=37}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377cuu=\33[%p1%dA\0", iov_len=29}, {iov_base="p\0\0\0\31\0\0\0\10\0\0\0\377\377\377\377cuu1=\33[A\0", iov_len=25}, {iov_base="p\0\0\0 \0\0\0\10\0\0\0\377\377\377\377cvvis=\33[?12;25h\0", iov_len=32}, {iov_base="p\0\0\0\35\0\0\0\10\0\0\0\377\377\377\377dch=\33[%p1%dP\0", iov_len=29}, {iov_base="p\0\0\0\31\0\0\0\10\0\0\0\377\377\377\377dch1=\33[P\0", iov_len=25}, ...], msg_iovlen=207, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[6]}], msg_controllen=24, msg_flags=0}, 0) = 5896                                                                                                                                                                                                                                               
32654 12:37:50.101552 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="n\0\0\0\20\0\1\0\10\0\0\0\377\377\377\377", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[7]}], msg_controllen=24, msg_flags=0}, 0) = 16 

Unfortunately I (partially) disagree. I would prefer to see this fixed in tmux. The patch you suggested is not possible for me to merge, as I will explain below.

This particular problem cannot be fixed in tmux, as it's during a Debian update, which updates tmux from some version to another. Both versions are already released.

This particular problem cannot be fixed in tmux, as it's during a Debian update, which updates tmux from some version to another. Both versions are already released.

Sure, but a newer version of tmux could be released to prevent this lockup. I just don't know where else this could be fixed. I am open to ideas.

Surely this is not an uncommon scenario to run into since tmux sessions can be so long running.

Even after sleeping on it for a few nights, I only have one idea how to improve the situation: Disable tmux support in root shells in general or use timeout when on linux.

I only have one idea how to improve the situation: Disable tmux support in root shells in general

Wait, what do root shells have to do with this? Does this bug only appear when tmux is run as root? Are you running the tmux session as root, then hanging in a normal user shell? Or the other way around; running tmux session as a normal user, then hanging in a root shell?

I only have one idea how to improve the situation: Disable tmux support in root shells in general

Wait, what do root shells have to do with this? Does this bug only appear when tmux is run as root?

No this always happens.
But the worst thing is when the root shells no longer work in the middle of something.