cea-hpc/clustershell

Fatal Python error: could not acquire lock for <_io.BufferedReader name='<stdin>'> at interpreter shutdown, possibly due to daemon threads

brianjmurrell opened this issue · 15 comments

I'm seeing this with python3-clustershell-1.8.3-5.fc33.noarch/python3-3.9.2-1.fc33.x86_64:

$ clush -B -l vagrant -R ssh -S -w vm1,vm2,vm3,vm4,vm5,vm6,vm7,vm8,vm9 'set -ex
declare -a ftest_mounts
mapfile -t ftest_mounts < <(grep '\''added by ftest.sh'\'' /etc/fstab)
for n_mnt in "${ftest_mounts[@]}"; do
    mpnt=("${n_mnt}")
    sudo umount "${mpnt[1]}"
done
sudo sed -i -e "/added by ftest.sh/d" /etc/fstab'
...
Fatal Python error: could not acquire lock for <_io.BufferedReader name='<stdin>'> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007fda5b020700 (most recent call first):
  File "/usr/local/lib/python3.6/site-packages/ClusterShell/CLI/Clush.py", line 620 in _stdin_thread_start
  File "/usr/lib64/python3.6/threading.py", line 864 in run
  File "/usr/lib64/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/lib64/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fda67de3740 (most recent call first):
/usr/lib/daos/TESTING/ftest/ftest.sh: line 69: 10381 Aborted                 clush "${CLUSH_ARGS[@]}" -B -l "${REMOTE_ACCT:-jenkins}" -R ssh -S -w "$(IFS=','; echo "${nodes[*]}")" "$(sed -e '1,/^$/d' "$SCRIPT_LOC"/pre_clean_nodes.sh)"

I found this: https://bugs.python.org/issue26037

Any thoughts?

As a workaround you probably won't get this with --nostdin, while that limits some usage quite a few commands would still work.

If the problem is we can't be in a read() when clustershell exits then a proper fix is going to be annoying... switching to asyncio as they suggest might be the least painful solution but sounds like overkill to me...

I think the work-around works. But as you say, it limits what can be done.

Is this something that can/will be ultimately fixed in a way that the work-around is not needed?

This definitely needs to get fixed, yes; just not sure of how yet so it might take some time.

I've also got a machine with the same versions (python3-3.9.2-1.fc33.x86_64 / python3-clustershell-1.8.3-5.fc33.noarch) and cannot reproduce the problem though, how often does this happen?
You did provide a command but maybe there are conditions around how it's started (e.g. from a script, non-interactively?) or something that would help trim this down.

It is running in a CI framework (i.e. so from a script).

It was happening 100% of the time before I used the work-around.

Hi Brian,

I remember catching strange issue when using Jenkins. The kind of file descriptor/tty Jenkins(Java?) is simulating could have sometimes unexpected behavior.

Is it a problem that appears when switching to Fedora 33 or this specific Python version?

  • Could try sending something on stdin like an empty file to see if that makes a difference (compared to using --nostdin)
  • --nostdin change the way Clush manages threads. Actually it is much simpler when --nostdin is used. That could explain why the bug happens only when using 2 threads.

Ah, thanks for the hint (weird file for jenkins), I could reproduce with a fifo:

$ mkfifo /tmp/fifo
(need to open the fifo for writing or open for reading will block)
$ cat > /tmp/fifo &
$ clush -w localhost uname -r < /tmp/fifo > /dev/null
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedReader name='<stdin>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x96b3c0)

Current thread 0x00007fa76c36d740 (most recent call first):
<no Python frame>
Aborted (core dumped)

I guess this won't be the first time fifos don't work like normal pipes..

That's great news that you were able to reproduce. Does that make it easier to create and land a PR to fix it?

A 2 node 100% reproducible fail case for me:

clush -bw 'system[5-6]' "clush -bw 'system[5-6]' echo 1"
system6: Fatal Python error: could not acquire lock for <_io.BufferedReader name=''> at interpreter shutdown, possibly due to daemon threads
system6:
system6: Thread 0x00007f2ce883f700 (most recent call first):
system6: File "/usr/lib/python3.6/site-packages/ClusterShell/CLI/Clush.py", line 620 in _stdin_thread_start
system6: File "/usr/lib64/python3.6/threading.py", line 864 in run
system6: File "/usr/lib64/python3.6/threading.py", line 916 in _bootstrap_inner
system6: File "/usr/lib64/python3.6/threading.py", line 884 in _bootstrap
system6:
system6: Current thread 0x00007f2cee7beb80 (most recent call first):
system5: Fatal Python error: could not acquire lock for <_io.BufferedReader name=''> at interpreter shutdown, possibly due to daemon threads
system5:
system5: Thread 0x00007fb4d199d700 (most recent call first):
system5: File "/usr/lib/python3.6/site-packages/ClusterShell/CLI/Clush.py", line 620 in _stdin_thread_start
system5: File "/usr/lib64/python3.6/threading.py", line 864 in run
system5: File "/usr/lib64/python3.6/threading.py", line 916 in _bootstrap_inner
system5: File "/usr/lib64/python3.6/threading.py", line 884 in _bootstrap
system5:
system5: Current thread 0x00007fb4d7956b80 (most recent call first):
clush: system[5-6] (2): exited with exit code 255
---------------
system[5-6] (2)
---------------
---------------
system[5-6] (2)
---------------
1

dnf list \*clustershell
Last metadata expiration check: 0:23:59 ago on Fri 06 Aug 2021 01:54:28 PM EDT.
Installed Packages
clustershell.noarch                          1.8.3-2.el8                   @epel
python3-clustershell.noarch                  1.8.3-2.el8                   @epel

Works 100% of the time on Centos7/ python2 config:

clush -bw 'system[1-2]' "clush -bw 'system[1-2]' echo 1"

---------------
system[1-2] (2)
---------------
---------------
system[1-2] (2)
---------------
1
clustershell.noarch          1.8.3-1.el7
python2-clustershell.noarch  1.8.3-1.el7

(Found trying more useful commands for cluster config and key exchange+text, and simplified to above to reproduce)

Any possibility of getting this fixed?

Hello,

Looks like this is up since a long time now.

Wouldn't it be possible to fix using an ugly try catch on the exception to just avoid showing anything, at least until something better is found?

My 2 cents.

Seems related to Python bug https://bugs.python.org/issue42717
I don't think this is easily catchable, as it as triggered when the Python interpreter exits.
clush's stdin thread is properly set as a daemon thread and thus should exit fine when the main thread exists.
Now, the Python 3 documentation says:
Note Daemon threads are abruptly stopped at shutdown. Their resources (such as open files, database transactions, etc.) may not be released properly. If you want your threads to stop gracefully, make them non-daemonic and use a suitable signalling mechanism such as an Event.

IIUC, when the main thread exits, if the daemon thread is still writing reading, this bug is triggered. Any possibility to signal the daemon thread to stop when the main thread want to exit and the exit?

@degremont s/writing/reading/?

The problem is we have a thread to read stdin that's just stuck in the read() call itself... I guess we could just remember its TID somewhere and send it any signal to wake it up but that doesn't strike me as a particularly good way of solving this.
In my opinion, stdin should "just" be added to the main epoll loop to process along with the ssh file descriptors, but it's a bit of an overhaul...

@martinetd right, "reading".

We are talked about adding the read FD to the main epoll loop, long time ago. I don't remember what was the outcome and why we ended up not doing this.
MilkCheck is watching stdin from epoll: https://github.com/cea-hpc/milkcheck/blob/master/lib/MilkCheck/UI/Cli.py#L591 and has a way to sync the thread at exits.

ah, I remember now: epoll behaves differently if stdin is a tty, a regular file, or a named pipe... iirc if we pass a named pipe as stdin epoll always returns immediately saying there's something to read?. . . testing happens. . . oh, named pipe (fifo) is fine, the problem seems to be regular file: Operation not permitted when doing epollctl add for a fd coming out of a regular file.

So we'd need to epoll it, or if it's a file assume it's always ready to read? That's annoying...