ros/ros_comm

xmlrpc.XmlRpcNode consumes 100%+ cpu after being shutdown

drjsmith opened this issue ยท 7 comments

TLDR: calling shutdown() on a XmlRpcNode does not fully shut it down and leaves a thread running.
I am currently on Noetic/Ubuntu 20.04/python 3.8.

What is going on:

  • Calling start() results in a new thread running self.server.serve_forever(), where server is of type ThreadingXMLRPCServer which has a base class of socketserver.BaseServer.
  • That function, defined for BaseServer in socketserver.py:215, will not exit until the loop's condition is satisfied: while not self.__shutdown_request:
  • self.__shutdown_request can be set by calling shutdown() on BaseServer
  • However, XmlRpcNode's shutdown() method never calls self.server.shutdown(), only server.socket.close() and server.server_close(). I'm pretty sure those last 2 functions do the same thing, since:
class TCPServer(BaseServer):
    ...
    def server_close(self):
        self.socket.close()

The thread is then stuck in an infinite loop. Before shutting down XmlRpcNode, the thread not use a significant amount of cpu; afterwards, however, the thread consumes 100% of a single core (as shown by top).

It seems to me that the solution may be as simple as adding self.server.shutdown() to XmlRpcNode's shutdown() function, though I'm not certain whether it should be before or after closing the socket or if there are other considerations.

Why this matters to me:
I'm guessing this isn't a common problem, since generally the process would be ending shortly afterwards. However, I am using roslaunch.parent.ROSLaunchParent to launch and shutdown nodes as part of a python-based robot navigation benchmarking codebase. I wrote the code and used it extensively on Kinetic/Ubuntu 16.04/python 2.7 and did not encounter this issue at that time: starting and stopping hundreds of instances of roslaunch.parent.ROSLaunchParent did not cause any noticeable issues. As soon as I switched to Noetic/Ubuntu 20.04/python 3.8, I noticed that after running a few experiments the python process was at times using ~166% cpu. Usage hits 100% if there is only 1 'stopped' instance; ~166% if there are 2 or more.
The cpu usage of the process drops very low as soon as I start a new instance; I'm not familiar enough with the Selector-related code to understand why this is so.

Minimum example:
The script starts a roscore and then repeatedly creates, starts, and stops roslaunch.parent.ROSLaunchParent instances; run top to see how the process' cpu usage changes.

import os
import time
import roslaunch

class RoslaunchTest(object):

    def __init__(self, is_core=True):
        os.environ["ROS_MASTER_URI"]="http://localhost:11312"
        self.is_core = is_core

    def start(self):
        print("Starting...")
        uuid = roslaunch.rlutil.get_or_generate_uuid(None, not self.is_core)

        self.roslaunch_object = roslaunch.parent.ROSLaunchParent(
            run_id=uuid, roslaunch_files=[],
            is_core=self.is_core, port=11312
        )
        self.roslaunch_object.start()
        print("Started!")

    def stop(self):
        print("Stopping...")
        self.roslaunch_object.shutdown()
        print("Stopped!")

def demo():
    core = RoslaunchTest(is_core=True)
    core.start()

    test = RoslaunchTest(is_core=False)
    def cycle():
        test.start()
        time.sleep(5)
        test.stop()
        time.sleep(10)

    for _ in range(4):
        cycle()

if __name__ == '__main__':
    demo()

I've noticed the same issues on the ROS Melodic/Ubuntu 18 but only with the Python 3.6.9. On the other hand this is not the case with the Python 2.7.

Same for me. Unusable at the moment. We want to switch between two driving modes with different set of nodes. we want to avoid running all at the time when it is not necessary. for that we have a launch_handler with service which should start the corresponding launch file. when killing one each time a new instance of my handler appears in my process list, which generates 100% cpu load. for each on/off it generates a new one.

@TobiMiller @fgrcar Here is how I worked around the problem:

#Class that fixes the problem
class RoslaunchShutdownWrapper(roslaunch.parent.ROSLaunchParent):

    def shutdown(self):
        server = self.server.server if self.server is not None and self.server.server is not None else None
        super(RoslaunchShutdownWrapper, self).shutdown()
        if server:
            server.shutdown()

    def __del__(self):
        print("Shutting down now!!!!!!")
        self.shutdown()

#How to use it:
roslaunch_object = RoslaunchShutdownWrapper(
                                run_id=..., roslaunch_files=...,
                                is_core=..., port=...,
                                sigterm_timeout=...
                            )
roslaunch_object.start()

Let me know if it works for you.

@drjsmith Thanks so much for posting what helped for you !

I'm surprised this hasn't been addressed sooner, it's really a pain. I made a QT user Interface using the roslaunch API basically to launch and shutdown nodes of our stack. But I'm seeing the same CPU error you mentioned as I shutdown the nodes, which makes our Qt interface extremely slow and impossible to use as some nodes are being shut.

Your proposition seems to work for me so far though! Based on it, I wrote a quick class that the Qt interface calls to launch and shutdown nodes :

class RoslaunchWrapperObject(roslaunch.parent.ROSLaunchParent):

    def start(self):
        super(RoslaunchWrapperObject, self).start()
        print(self.server)

    def stop(self):
        print("Stopping...")
        print(self.server)
        server = self.server.server if self.server is not None and self.server.server is not None else None
        super(RoslaunchWrapperObject, self).shutdown()
        if server:
           server.shutdown()

To launch a node from the QT :

def launch(parameter):
    uuid = roslaunch.rlutil.get_or_generate_uuid(None, False)
    roslaunch.configure_logging(uuid)
    cli_args = parameter
    roslaunch_args = cli_args[1:]
    roslaunch_file = [(roslaunch.rlutil.resolve_launch_arguments(cli_args)[0], roslaunch_args)]
    launcher = RoslaunchWrapperObject(run_id = uuid, roslaunch_files = roslaunch_file)
    return launcher

and the calls are made like this :

cli_args = ['PATH/online.launch','simulation:=true']
        self.launch = launch(cli_args)
        self.launch.start() # to start it 
        self.launch.stop() # to stop it 

Thank you so much for this, it really helped ! I do hope they address the issue and find a proper way to fix it so we can reuse the API just like on previous ROS versions.

@RodolpheCyber You're welcome!

is this issue still present on melodic?

@adipotnis I discovered the problem when upgrading from kinetic to noetic, so I haven't personally tested it on melodic. @fgrcar reported encountering the problem on melodic but only when using python 3. I wouldn't be surprised if the root bug has been present for many versions and has only been exposed due to the transition to python 3, though I haven't verified this.