tango-controls/pytango

How to connect DeviceProxy to device launched via MultiDeviceTestContext

DrewDevereux opened this issue · 10 comments

Tests that use MultiDeviceTestContext will sometimes pass, sometimes error out with a message objecting that environment variable TANGO_HOST is not defined.

If you set that environment variable, MultiDeviceTestContext will try to connect to a database at that location. And if you stand up a database there, it will talk to that database!

I can see in the code that the tango database is supposed to be mocked out with a text file. It seems that that mocking isn't always working.

I haven't found a way to fix the problem, but I can make the problem happen every time by putting a time.sleep(5) at the top of the post init callback. This makes me think that there is some kind of race condition happening.

The issue occurs regardless of whether your asynchronous context is threads or processes, but it is worse for threads.

@DrewDevereux Thanks for the bug report. I think the TANGO_HOST should only be needed if a DeviceProxy (or similar) object is instantiated with a Tango name that doesn't include #dbase=no. If you could post some example code or a link to yours, that would help with the investigation.

See for example the four tests in https://gitlab.com/ska-telescope/lfaa-lmc-prototype/-/blob/master/tests/test_integration.py, all currently skipped. If run, they might succeed, but far more likely they will error out with something like the following. It is the unpredictability that makes me think it is a race condition -- for example I have had tests passing locally, so I push, and then they all fail in CI.

self = <tests.test_integration.TestMccsIntegration object at 0x7f6c054454a8>

    def test_master_enable_subarray(self):
        """
        Test that a MccsMaster device can enable an MccsSubarray device.
        """
        devices_info = [
            {
                "class": MccsMaster,
                "devices": (
                    {
                        "name": "low/elt/master",
                        "properties": {
                            "MccsSubarrays": ["low/elt/subarray_1",
                                              "low/elt/subarray_2"],
                            "MccsStations": ["low/elt/station_1"],
                            "MccsTiles": ["low/elt/tile_47"],
                        }
                    },
                )
            },
            {
                "class": MccsSubarray,
                "devices": [
                    {
                        "name": "low/elt/subarray_1",
                        "properties": {
                        }
                    },
                    {
                        "name": "low/elt/subarray_2",
                        "properties": {
                        }
                    },
                ]
            }
        ]
    
        with MultiDeviceTestContext(devices_info) as context:
            master = context.get_device("low/elt/master")
            subarray_1 = context.get_device("low/elt/subarray_1")
            subarray_2 = context.get_device("low/elt/subarray_2")
    
            # check both subarrays are disabled
            assert subarray_1.adminMode == AdminMode.OFFLINE
            assert subarray_1.State() == DevState.DISABLE
    
            assert subarray_2.adminMode == AdminMode.OFFLINE
            assert subarray_2.State() == DevState.DISABLE
    
            # enable subarray 1
>           master.EnableSubarray(1)

tests/test_integration.py:66: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.tox/py37/lib/python3.7/site-packages/tango/device_proxy.py:244: in f
    return dp.command_inout(name, *args, **kwds)
.tox/py37/lib/python3.7/site-packages/tango/green.py:195: in greener
    return executor.run(fn, args, kwargs, wait=wait, timeout=timeout)
.tox/py37/lib/python3.7/site-packages/tango/green.py:109: in run
    return fn(*args, **kwargs)
.tox/py37/lib/python3.7/site-packages/tango/connection.py:108: in __Connection__command_inout
    r = Connection.command_inout_raw(self, name, *args, **kwds)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = Device(low/elt/master), cmd_name = 'EnableSubarray', cmd_param = 1

    def __Connection__command_inout_raw(self, cmd_name, cmd_param=None):
        """
        command_inout_raw( self, cmd_name, cmd_param=None) -> DeviceData
    
                Execute a command on a device.
    
            Parameters :
                    - cmd_name  : (str) Command name.
                    - cmd_param : (any) It should be a value of the type expected by the
                                  command or a DeviceData object with this value inserted.
                                  It can be ommited if the command should not get any argument.
            Return     : A DeviceData object.
    
            Throws     : ConnectionFailed, CommunicationFailed, DeviceUnlocked, DevFailed from device
        """
        param = __get_command_inout_param(self, cmd_name, cmd_param)
>       return self.__command_inout(cmd_name, param)
E       PyTango.DevFailed: DevFailed[
E       DevError[
E           desc = TANGO_HOST env. variable not set, set it and retry (e.g. TANGO_HOST=<host>:<port>)
E         origin = Database::Database
E         reason = API_TangoHostNotSet
E       severity = ERR]
E       
E       DevError[
E           desc = Failed to execute command_inout on device low/elt/master, command EnableSubarray
E         origin = Connection::command_inout()
E         reason = API_CommandFailed
E       severity = ERR]
E       ]

Thanks, I think I see what is happening. The code that fails is instantiating a DeviceProxy, but the FQDN it is using is the short form, low/elt/subarray_1. If we use the short form, then the DeviceProxy has to connect to a Tango database to find the hostname and port connection details for the device. Similarly, an FQDN like tango://mydatabase:10000/low/elt/subarray_1 will connect to the Tango database first.

The DeviceProxy objects are not aware that we are running some devices via the MultiDeviceTestContext. To connect without a database, we need an FQDN like tango://172.17.0.3:48493/low/elt/subarray_1#dbase=no. In this case we are providing the IP (or hostname) and port for the device itself.

We need to modify the FQDNs that the device under test will use. Conveniently your code gets them from device properties, so we can modify them. However, we have to know what they are before starting the device, since the device only get the properties during init_device.

The MultiDeviceTestContext has a method that builds up the nodb device names we would need:

def get_device_access(self, device_name):
"""Return the full device name."""
form = 'tango://{0}:{1}/{2}#{3}'
return form.format(self.host, self.port, device_name, self.nodb)

but we have a "chicken-and-egg" problem. We need an instance of the test context to use the method it, but we need to provide the properties before creating an instance. Thus we end up re-implementing it. We also need to know the hostname/IP and port before instantiating. By default, the test context figures that out and Tango allocates a free port automatically. However, we can provide these details as kwargs.

It will be clearer with example code:

import socket
import tango.test_context


# taken from https://github.com/tango-controls/pytango/blob/df925b55/tests/test_event.py#L30
def get_open_port():
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.bind(("", 0))
    s.listen(1)
    port = s.getsockname()[1]
    s.close()
    return port


def test_example():
    host = tango.test_context.get_host_ip()
    port = get_open_port()

    def get_nodb_device_access(device_name):
        form = 'tango://{0}:{1}/{2}#dbase=no'
        return form.format(host, port, device_name)

    devices_info = [
        ...
                    "name": "low/elt/master",
                    "properties": {
                        "MccsSubarrays": [get_nodb_fqdn("low/elt/subarray_1"),
                                          get_nodb_fqdn("low/elt/subarray_2")],
        ...
    ]

    with MultiDeviceTestContext(devices_info, host=host, port=port) as context:
        master = context.get_device("low/elt/master")
        ...

Flakiness / race condition

As to why it appears flaky, my only guess it that in you local environment you have the TANGO_HOST variable defined, and sometimes you have a databaseds and a real subarray device running. The test connects to the real device instead of the one launched as by MultiDeviceContext.

Alternative testing strategy

We could also mock out the DeviceProxy objects and not launch all the dependent devices, which is the approach used here:
https://gitlab.com/ska-telescope/tmc-prototype/-/blob/ceedf59b/tmcprototype/centralnode/test/unit/central_node_test.py

@ajoubertza Interesting!

Another alternative would be to support #dbase=no in the TANGO_HOST variable.

For instance, TANGO_HOST=172.17.0.3:48493#dbase=no would resolve low/elt/subarray_1 to tango://172.17.0.3:48493/low/elt/subarray_1#dbase=no. Then the DeviceTestContext could set the proper tango host through os.environ["TANGO_HOST"] for the duration of the test context.

Also note that pytango might "easily" support this before cpptango using the following check in the __DeviceProxy__init__ method:

tango_host = os.environ.get("TANGO_HOST", "")
tango_host, suffix = tango_host.split("#") if "#" in tango_host else (tango_host, "")
if tango_host and suffix and not is_resolved(device_name): 
    device_name = resolve(device_name, tango_host) + "#" + suffix

Thanks @vxgmichel - that's a cool idea! It would make testing a lot simpler.

What would the is_resolved and resolve methods do? I'm guessing resolve is just sanitising user input to build up the FQDN reliably. I'm not sure why we would need an is_resolved check.

Thanks very much @ajoubertza for this code. I confirm that I can get my tests to pass this way. I am uncertain whether this is intended as a workaround until the code is "fixed", or instructions for how to correctly use MultiDeviceTestContext. If the latter, I think an update to the class autodoc is needed, because the autodoc example makes it look very much more straightforward than this.

Re "flakiness", my experience was not just that the tests were sometimes passing, sometimes failing. There were also times when tests 1, 2 and 4 would pass but 3 would fail, for example. And putting a sleep into the post-init callback would cause all four to fail. So I feel there is still a mystery there... but maybe it will have to remain a mystery for now.

Thanks also for the link to a mocking example. I want to treat the example I gave as a lightweight integration test, so no mocking there. But mocking will come in very useful for unit testing.

@ajoubertza

What would the is_resolved and resolve methods do? I'm guessing resolve is just sanitising user input to build up the FQDN reliably. I'm not sure why we would need an is_resolved check.

This is roughly what I had in mind:

def is_resolved(device_name):
    return device_name.startswith("tango://")

def resolve(device_name, tango_host):
    return f"tango://{tango_host}/{device_name}"

Although I suspect the actual implementation is going to be a bit trickier than that :)

Things got very messy for me trying to implement this solution into a system where device A passes device B the FQDNs of devices that device B should be talking to... until I realised that I only need the long form (so far) when creating a DeviceProxy.

So I created a pytest fixture in conftest.py that monkey-patches tango.DeviceProxy:

class MyDeviceProxy(tango.DeviceProxy):
    def _get_open_port():
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(("", 0))
        s.listen(1)
        port = s.getsockname()[1]
        s.close()
        return port

    HOST = get_host_ip()
    PORT = _get_open_port()

    def _get_nodb_fqdn(self, device_name):
        form = 'tango://{0}:{1}/{2}#dbase=no'
        return form.format(self.HOST, self.PORT, device_name)

    def __init__(self, device_name, *args, **kwargs):
        super().__init__(self._get_nodb_fqdn(device_name), *args, **kwargs)

tango.DeviceProxy = MyDeviceProxy  # monkey-patch

@pytest.fixture(scope="class")
def tango_multidevice_context():
    """
    Creates and returns a TANGO MultiDeviceTestContext object.
    """

    devices_info = [
        {
            "class": MccsMaster,
            "devices": (
                {
                    "name": "low/elt/master",
                    "properties": {
                        "MccsSubarrays": ["low/elt/subarray_1",
                                          "low/elt/subarray_2"],
                        "MccsStations": ["low/elt/station_1",
                                         "low/elt/station_2"],
                        "MccsTiles": ["low/elt/tile_47",
                                      "low/elt/tile_129"],
                    }
                },
            )
        },
        {
            "class": MccsSubarray,
            "devices": [
                {
                    "name": "low/elt/subarray_1",
                    "properties": {
                    }
                },
                {
                    "name": "low/elt/subarray_2",
                    "properties": {
                    }
                }
            ]
        },
        {
            "class": MccsStation,
            "devices": [
                {
                    "name": "low/elt/station_1",
                    "properties": {
                    }
                },
                {
                    "name": "low/elt/station_2",
                    "properties": {
                    }
                },
            ]
        },
        {
            "class": MccsTile,
            "devices": [
                {
                    "name": "low/elt/tile_47",
                    "properties": {
                    }
                },
                {
                    "name": "low/elt/tile_129",
                    "properties": {
                    }
                },
            ]
        },
    ]

    with MultiDeviceTestContext(devices_info, host=MyDeviceProxy.HOST, port=MyDeviceProxy.PORT) as context:
        yield context

and now my tests are lovely and clean again!

I wanted to share this with you in case it leads to another solution.

Thanks for the further updates @DrewDevereux. For now, it is a workaround, but yes, the docstring can be improved. Your mocking approach is nice. You could submit something like that as an example in this repo (see the examples folder). That would make it even clearer than we can do in a docstring.

In future, I'd would like it to work automatically, which is where @vxgmichel's idea comes in handy. Then you wouldn't need the mock.

Hi @ajoubertza. As suggested, I have made a pull request with a MultiDeviceTestContext pytest fixture, and an example of its use in testing.