kquick/Thespian

Logging differences based on startup network state

bryang-spindance opened this issue · 2 comments

Hi again. Sorry to keep posting new issues for you.

Environment

Tested on Debian 10 (buster) with an ARM Cortex A8 (1 core)
Also tested on an Ubuntu 20.04 LTS VM with 4 cores allocated to test if this was a threading issue.

Python Versions: 3.7.3 3.8.10, 3.9.5

Manually set the following environment variables:
export THESPLOG_FILE=./thespian.log
export THESPLOG_THRESHOLD=DEBUG

Problem

The Thespian logging system has unexpected behavior when network connectivity is lost on the host machine, specifically when network connectivity is lost in the form of the network adapter becoming disabled or disconnected. This is an issue since if WiFi connection is lost or an ethernet cable becomes unplugged, the network adapter is disabled.

There are two main scenarios that I've discovered when testing this issue. The two files below are used in both scenarios.

start.py

This is the startup script for this test. It sets up the actorsystem with logging then creates the top-level actor. It then enters a loop where it waits for a Ctrl+C (SIGINT) signal. When this signal is caught, the actor system is shut down and the program exits.

import signal
from thespian.actors import ActorSystem
import time
import top

import logging
class actorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' in logrecord.__dict__

class notActorLogFilter(logging.Filter):
    def filter(self, logrecord):
        return 'actorAddress' not in logrecord.__dict__

logcfg = {
    'version': 1,
    'formatters': {
        'normal': {'format': '%(levelname)-8s %(message)s'}, 'actor': {'format': '%(levelname)-8s %(actorAddress)s => %(message)s'}
    },
    'filters': {'isActorLog': {'()': actorLogFilter}, 'notActorLog': {'()': notActorLogFilter}},
    'handlers': {
        'h1': {
            'class': 'logging.FileHandler',
            'filename': 'actor_log.log',
            'formatter': 'normal',
            'filters': ['notActorLog'],
            'level': logging.DEBUG
        },
        'h2': {
            'class': 'logging.FileHandler',
            'filename': 'actor_log.log',
            'formatter': 'actor',
            'filters': ['isActorLog'],
            'level': logging.DEBUG}
        },
    'loggers' : {'': {'handlers': ['h1', 'h2'], 'level': logging.DEBUG}
    }
}

def handler(signum: int, frame):
    ActorSystem('multiprocTCPBase').shutdown()
    exit(1)

if __name__ == "__main__":
    asys = ActorSystem('multiprocTCPBase', logDefs=logcfg)
    # Signal handling for stopping the actor system using Ctrl+C
    signal.signal(signal.SIGINT, handler)

    top_actor = asys.createActor(top.Top, globalName="top")
    asys.tell(top_actor, top.Initialize())

    while True:
        time.sleep(1)

top.py

This is the top level actor. It has two wakeup loops that we are using to figure out what is happening in the logs. The first wakeup loop is print which just prints out the number of times that it has woken up. We use this to ensure that the system is still running while network is disconnected. The second loop is network which checks if we have network connectivity every second and prints out "CONNECTED" or "DISCONNECTED" whenever the network state changes.

import logging as log
import os
import http.client as httplib
from thespian.actors import ActorTypeDispatcher

class Initialize:
    def __init__(self) -> None:
        pass

def ping(ip, timeout=0.5):
    conn = httplib.HTTPSConnection(ip, timeout=timeout)
    try:
        conn.request("HEAD", "/")
        return True
    except Exception:
        return False
    finally:
        conn.close()

def is_online(ips=None):
    ips = ["1.1.1.1", "8.8.8.8", "8.8.4.4", "4.2.2.4"] if ips is None else ips
    return any([ping(ip) for ip in ips])

class Top(ActorTypeDispatcher):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.wakeup_count = 0
        self.online = False

    def receiveMsg_Initialize(self, msg, sender):
        log.debug(f"Top {self.myAddress} (P|:{os.getpid()}) received Initialize command from {sender}")
        self.wakeupAfter(timePeriod=1, payload='network')
        self.wakeupAfter(timePeriod=5, payload='print')

    def receiveMsg_WakeupMessage(self, msg, sender):
        if msg.payload == 'network':
            log.debug(f'CHECKING NETWORK STATUS')
            prev_online = self.online
            self.online = is_online()
            if self.online != prev_online:
                if self.online:
                    log.critical(f'CONNECTED!')
                else:
                    log.critical(f'DISCONNECTED!')
            self.wakeupAfter(timePeriod=1, payload='network')
            return None

        if msg.payload == 'print':
            log.debug(f'WAKEUP COUNT: {self.wakeup_count}')
            self.wakeup_count += 1
            self.wakeupAfter(timePeriod=5, payload='print')
            return None

Scenario 1: Startup Online

In this scenario, the Thespian ActorSystem is created and initialized while there is an active network connection on the host machine (connected both to the local network and able to ping out to the internet).

Log Files: actor_log, thesplog

  1. Ensure network is connected and we can ping out to the internet (like 8.8.8.8)
  2. Run python start.py
  3. After logs start coming in to actor_log.log, disable network (disconnect from wifi, unplug ethernet cable, disable network adapter, etc.)
    • At this point, we notice that actor_log.log is no longer receiving any updates while thespian.log continues showing the top actor continuing execution as expected.
  4. Enable/reconnect network whenever you want.
    • After a short time after reconnecting to the network, actor_log.log will receive all of the queued messages that it missed while the network was disconnected.

In this scenario, it seems that the ActorSystem is queueing all messages to the outward-facing logging actor and waiting for the network connection to reestablish. However this seems odd to me since I would have thought that thespian would use the localhost (127.0.0.1) for all internal operations such as allowing the logging actor to write to its output file. I suppose I don't fully understand the inner workings of the TCP/UDP base.

My main concern with this is that the thespian internal logging may continue queueing these messages during long periods of network loss and it will cause a system crash. I noticed that the longer I waited before reconnecting the network, the longer it took for the queued messages to get written to the actor_log.log file.

Scenario 2: Startup Offline

In this scenario, the Thespian ActorSystem is created and initialized while there is not an active network connection on the host machine (network cable unplugged, wifi not connected, adapter disabled, etc.).

Log Files: actor_log, thesplog

  1. Ensure network is not connected so that we cannot ping out to the internet.
  2. Run python start.py
  3. We notice that logs will start coming in to actor_log.log.
  4. Connect to network.
    • We will see a "CONNECTED" message in actor_log.log.
  5. Disable/disconnect network connection.
    • We will see a "DISCONNECTED" message in actor_log.log.

This scenario is how I was expecting the ActorSystem's logging to operate in the event of network loss (such as in scenario 1's case).

Additional Notes

With multiprocUDPBase, Scenario 1 just errored out when I lost network connectivity. actor_log, thesplog

With multiprocUDPBase, Scenario 2 operated the same as with multiprocTCPBase.

Gists

Here is a Gist containing the log files that were mentioned in this problem: https://gist.github.com/bryang-spindance/9bfbe442f04aa588f5af20305c187d61

Thank you for posting issues: they all help to improve Thespian in some way!

You've encountered a bit of a tricky support contortions in the networking support: namely that each Actor should self-identify with an address that is reachable by other actors. For example, if an actor simply identified itself as being at ("127.0.0.1", 28429) (aka. port 28429 on localhost) then it couldn't send that address to an actor running on another system because that remote actor couldn't use that address to send messages back. This is further complicated by the common situation of having multiple network connections available on a system, each with its own address range. Because of that, there is some special code here (https://github.com/kquick/Thespian/blob/master/thespian/system/transport/IPBase.py#L9-L251) that attempts to determine the best address for an Actor to self-identify with. One of the main elements used by this code is the Convention Leeader's specified address: the assumption is that the Convention Leader will indicate which network the Actor Convention should exist on and that an Actor should identify itself with an address on that network.

Note that Thespian currently does not support an Actor Convention that spans multiple networks; I'm not sure it ever will (a VPN would probably be a better approach to that scenario anyhow). This isn't your situation, but I thought I should mention this limitation while on the subject.

In your case, you aren't using a Convention, so you are reasonably expecting the use of localhost. However, the Thespian code above was written to support eventual conventions, so in the case where there is no Convention Leader specified, it will default to using Google's DNS server (at 8.8.8.8) to identify the network most likely to provide general access to the current system. This is (obviously) why shutting down the network on your system causes logging output to be suspended: the network providing the address routing is down.

Also, in case I haven't adequately explained this elsewhere, here's how logging works:

       Actors ----------------> Admin --------------> Logger  [ ----------> Remote Logger ]

Each of these nodes uses the same transport functionality for communication. You hadn't noticed the effects yet (primarily because it's for internal support) in your network disconnect situation, but your TopActor also lost communication to the Admin when your network went down. You would also have lost connectivity between Actors themselves, had you been running multiple Actors.

When there is a delivery problem, the effects depend on which system base is used:

  • multiprocUDPBase; UDP has no concept of delivery verification, thus it has no way of confirming delivery or not. The network stack will eventually simply drop the undeliverable UDP message, but the sendmsg() call has already completed, so there's no way to indicate that failure to the application.

  • multiprocTCPBase: this base uses TCP sockets, which do have delivery information. It uses asynchronous processing for efficiency, so this means that Thespian and the socket library are able to detect delivery failures (network connectivity problems) and retry transmission.

    On the Thespian side of this, that means that outbound messages get queued internally. Once the queue reaches a certain threshold, no new messages will be accepted (the receiveMessage will never be invoked); this causes back-propagation of flow control, eventually blocking all Actors from running as their queues get full. Once connectivity is restored, the queues will drain and normal operation will resume. This prevents Thespian from simply consuming all available memory and causing a system crash (one of your concerns from Scenario 1).

I believe the above correctly describes the effects of your testing in the different scenarios. Let me know if I've overlooked something somewhere though.

Solution

In your case, the solution is to indicate to Thespian which network you'd like it to use by specifying the Convention Admin address (even though you won't be running a Convention); see https://thespianpy.com/doc/using.html#hH-9d33a877-b4f0-4012-9510-442d81b0837c. Something like:

if __name__ == "__main__":
    asys = ActorSystem('multiprocTCPBase', logDefs=logcfg, capabilities={'Convention Admin.IPv4': '127.0.0.1:1900'})

This should be enough to inform Thespian that it should use localhost for the addresses of Actors (and the logger).

Let me know if this resolves your issues and questions. I'm also going to add some better description to the Thespian documentation to provide a better explanation.

Kevin,

I tried your solution but ended up with the same result as before. I also tested a slightly different setup after testing the change you suggested.

The results:

  • network connected -> start actor system -> network disconnected -> logging freezes whenever network disconnected

  • network disconnected -> start actor system -> logging works fine no matter how many times i plug/unplug the network cable

Here is my current setup that I used as an additional test. I have a Raspberry Pi with the following network adapters:

Network A: 192.168.1.1/24

  • Physical network adapter.
  • Plugged in to local network and connected to internet.

Network B: 192.168.2.1/24

  • Virtual network adapter used to provide a connection to the Pi via USB
  • Not used for anything other than SSH access.
  • Doesn't change if network A is unplugged/disabled.

To test network connection, I am simply unplugging the cable going to Network Card A.
The ActorSystem is set up with the following capabilities:

{'Convention Admin.IPv4': '192.168.2.1:1900'}

I set up the convention address as the address of Network Card B. I did this because unplugging card A's network cable should have no effect on card B. However, even with this setup, unplugging card A from the network still freezes all logging in my child actor. This almost seems like the specified convention admin IP isn't being recognized but there is no difference in the thesplog.log file. I even tried manually updating the capabilities using the following code to see if anything changed in the thesplog but I didn't see any difference in the logs aside from the lines in the logs below the code.

asys = ActorSystem('multiprocTCPBase', logDefs=logcfg, capabilities={'Convention Admin.IPv4': '192.168.2.1:1900'})
asys.updateCapability('Convention Admin.IPv4', '192.168.2.1')
asys.updateCapability('Admin Port', 1901)
2022-03-22 20:59:30.789706 p19657 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.998861-<class 'thespian.system.messages.admin.CapabilityUpdate'>-CapUpdate(Convention Admin.IPv4 = 192.168.2.1)-quit_0:04:59.998548)
2022-03-22 20:59:30.811901 p19677 dbg  Admin of ReceiveEnvelope(from: ActorAddr-(T|:32871), <class 'thespian.system.messages.admin.CapabilityUpdate'> msg: CapUpdate(Convention Admin.IPv4 = 192.168.2.1))
2022-03-22 20:59:30.817491 p19657 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.999137-<class 'thespian.system.messages.admin.CapabilityUpdate'>-CapUpdate(Admin Port = 1901)-quit_0:04:59.998844)
2022-03-22 20:59:30.830468 p19677 dbg  Admin of ReceiveEnvelope(from: ActorAddr-(T|:32871), <class 'thespian.system.messages.admin.CapabilityUpdate'> msg: CapUpdate(Admin Port = 1901))
2022-03-22 20:59:30.871168 p19657 dbg  actualTransmit of TransportIntent(ActorAddr-(T|:1900)-pending-ExpiresIn_0:04:59.982420-<class 'thespian.system.messages.admin.PendingActor'>-PendingActor#1_of_Noneis"top"-quit_0:04:59.982094)
2022-03-22 20:59:30.895807 p19677 dbg  Admin of ReceiveEnvelope(from: ActorAddr-(T|:38407), <class 'thespian.system.messages.admin.PendingActor'> msg: PendingActor#1_of_Noneis"top")
2022-03-22 20:59:30.906039 p19677 I    Pending Actor request received for top.Top reqs None from ActorAddr-(T|:38407)
2022-03-22 20:59:30.998343 p19680 I    Starting Actor top.Top at ActorAddr-(T|:38421) (parent ActorAddr-(T|:1900), admin ActorAddr-(T|:1900), srcHash None)

I was expecting the parent (admin) address in the last line of the log to show ActorAddr-(T|:1901) instead of 1900. I was testing port 1901 in this case just because that port actually shows up in the thesplog so I was hoping that making changes to the capabilities would be easy to see in the logs.

However... after initializing the Top actor, I ran the following lines

top_status = asys.ask(top_actor, Thespian_StatusReq(), 10)
res = asys.ask(top_status.adminAddress, Thespian_StatusReq(), 10)
print(res.capabilities)

This printed out the capabilities containing the correct IP and Admin Port values.