ploxiln/fab-classic

Load flakiness retry logic mismatches paramiko-ng

timsnyder-siv opened this issue · 1 comments

# If we get SSHExceptionError and the exception message indicates
# SSH protocol banner read failures, assume it's caused by the
# server load and try again.
#
# If we are using a gateway, we will get a ChannelException if
# connection to the downstream host fails. We should retry.
if (e.__class__ is ssh.SSHException and msg == 'Error reading SSH protocol banner') \
or e.__class__ is ssh.ChannelException:
if _tried_enough(tries):
raise NetworkError(msg, e)
continue

specifically the msg == 'Error reading SSH protocol banner' seems to be too strictly checking the message.

I had a long-running @parallel fabric thing crash with this stacktrace:

Exception: Error reading SSH protocol banner
Traceback (most recent call last):
  File ".conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 2049, in _check_banner
    buf = self.packetizer.readline(timeout)
  File ".conda-env/lib/python3.9/site-packages/paramiko/packet.py", line 360, in readline
    buf += self._read_timeout(timeout)
  File "conda-env/lib/python3.9/site-packages/paramiko/packet.py", line 575, in _read_timeout
    raise EOFError()
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 1904, in run
    self._check_banner()
  File ".conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 2053, in _check_banner
    raise SSHException(
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer
Traceback (most recent call last):
  File ".conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 2049, in _check_banner
    buf = self.packetizer.readline(timeout)
  File ".conda-env/lib/python3.9/site-packages/paramiko/packet.py", line 360, in readline
    buf += self._read_timeout(timeout)
  File ".conda-env/lib/python3.9/site-packages/paramiko/packet.py", line 573, in _read_timeout
    x = self.__socket.recv(128)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/centos/src/project_data/federation_pit-1175/firesim/.conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 1904, in run
    self._check_banner()
  File "/home/centos/src/project_data/federation_pit-1175/firesim/.conda-env/lib/python3.9/site-packages/paramiko/transport.py", line 2053, in _check_banner
    raise SSHException(
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer
Fatal error: Needed to prompt for a connection or sudo password (host: 10.2.0.5), but input would be ambiguous in parallel mode
Aborting.

I have env.connection_attempts = 10 and I only see three nested exceptions. I'm also using key-based auth. The last one:

paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer

I'm wondering if the SSHException message is ending up with more stuff in it and we changed the referenced code to be 'Error reading SSH protocol banner' in msg it would correctly retry in this case. @ploxiln would you consider this a 'bug fix' or am I reaching too far?

Looking at the paramiko-ng code in question https://github.com/ploxiln/paramiko-ng/blob/b2322db80f55c9b07e518e555ece6284d4577cf0/paramiko/transport.py#L2039-L2041 it does seem like it stringifies the underlying exception and the message will only start with 'Error reading SSH protocol banner'