ros-visualization/rqt_rviz

RViz within rqt environment crashing

dirk-thomas opened this issue · 15 comments

From @sromberg on May 9, 2014 8:14

Commit #95 very likely causes RViz to crash whenever it is used from the rqt environment. However, RViz does not crash when run as standalone application. This has happend to my experience when drawing OccupancyGrid maps and when drawing Markers as path. Both plugins seem to use ogre::Texture::createInternalResources internally which in the end cause it to crash.

While the application eventually causes the graphics driver to crash, it is not related to a certain GPU driver, version or vendor. I experienced this issue on different machines with drivers from various vendors and e.g. in VirtualBox as well. Instead, this issue seems very much related to threading and the wrong thread accessing the resources that have been/are being created within the Ogre engine.

Summary:
rqt tags/0.2.13 works
rqt tags/0.2.14 crashes

See #95 for a detailed description.

Copied from original issue: ros-visualization/rqt#96

As of #95 rqt runs an async spinner for C++ plugins. This is necessary in order to have a dedicated thread dealing with incoming messages for the nodelets.

rviz is not designed to be run as a nodelet in the first place therefore I assume it is running its own spinner. With two different spinners running the context / thread in which objects are created / used is not deterministic anymore which will likely be the problem.

Since the recent change is necessary for rqt C++ plugins I think the only possible solution will be to enhance the rqt_rviz plugin to prevent the embedded rviz instance from creating its own spinner.

From @jpardeiro on January 6, 2016 9:16

I tried the last commit but this is not fixing the issue for me. Any other update regarding to this problem?

No updates on this. The referenced commit is not related to this problem as far as I can see.

From @mikepurvis on July 20, 2016 17:45

What is the rationale for not reverting #95, given the severity of this regression? We're pinned to rqt 0.2.13 due to this issue.

#95 is necessary to enable rqt plugins to use common features, e.g. ROS timers. So reverting the change would imply breaking these. Using an async spinner is the "right" way for nodelets. But as mentioned above the problem is that rviz is not designed to run in a nodelet.

Since the option to run rqt in multi-process mode is not available in Qt 5 I think the only way forward is to make rviz compatible to work as a nodelet.

From @mikepurvis on July 20, 2016 18:38

I can accept that rqt_rviz is a community-maintained plugin, and therefore you're not going to block forward progress on its account, but it's frustrating that this has been broken for over two years. If OSRF do not have the resources to directly fix this issue, could you provide us with design guidance and review in order to get it fixed ourselves in a timely manner?

I understand that the broken state is frustrating. It is simply a use case rviz was not designed for. But I don't expect that anyone at OSRF will have time to work on a fix. Simply because we are aware of the problem since over two years and didn't get to it until now (and none of our projects requests / relies on this functionality which could justify the effort).

I can't speak much about the internals of rviz and therefore can't describe the steps necessary to make it work. The rviz API needs to support instantiating the UI in a widget which gets added to the main application of rqt without being "in control" of the process. It will probably involve using the spinner and / or the queues created externally rather then creating them within rviz. @wjwwood knows much more about the internals of rviz. Maybe he can make more specific suggestions and help on the way?

From @mikepurvis on July 21, 2016 17:47

@wjwwood Should we open a new ticket on the rviz tracker to discuss this? We're interested in fixing the issue here, but we can't invest effort in it unless OSRF will support the process.

From @wjwwood on July 21, 2016 18:58

The problem is that even I don't know what needs to be done to fix the problem. It's going to take some enterprising individual becoming sufficiently familiar with rviz's runtime logic so that they can figure out what needs to change and why, and then communicate the rational to the maintainers (like me).

You can create a new ticket if you like, or continue the discussion on the existing issues and pull requests. I'll try to help you where I can, but honestly I don't have any extraordinary understanding of this part of rviz's code.

I have marked the plugin "experimental" (ros-visualization/rqt_robot_plugins@7d53f4f) since this issue has been long standing and that gives users at least a more accurate expectation.

From @mikepurvis on December 1, 2016 1:16

I have also added an admonition to the rqt_rviz wiki page:

http://wiki.ros.org/rqt_rviz

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-ros1-package-rqt-embed-window/19701/1

ros-visualization/rviz#1635 mitigates this issue / ros-visualization/rviz#1042.
In a few tests, I haven't observed an rqt segfault anymore running both, synchronously and asynchronously updating displays.
However, the fundamental issue remains: There are two spinners (the rqt thread introduced in ros-visualization/rqt#95 and rviz' Qt timer) operating on the global callback queue.
I think the correct solution would be to use a separate callback queue by rqt_rviz because it is important that GUI updates due to incoming ROS messages should happen in the main GUI thread.

I thought I'd also post here. I made rqt_embed_window which allows to embed 'any' window into rqt, including Rviz. This way Rviz hasn't crashed for me. It is running as a different process so I wouldn't see why it would run into this issue.

rviz is not designed to be run as a nodelet in the first place therefore I assume it is running its own spinner. With two different spinners running the context / thread in which objects are created / used is not deterministic anymore which will likely be the problem.

The main issue is that RViz runs its own spinner in the Gui thread, whereas the rqt spinner runs not in the Gui thread but in separate thread. Now it can happen that e.g. a RViz Ros callback is called in this rqt spinner thread or the opposite that an rqt Ros callback is called in the Gui thread. Obviously both can lead to big trouble, as RViz callbacks may update gui/ogre stuff (which is only allowed in the gui thread).

One example where RViz is not thread safe is the interactive_markers. Let's assume you start rqt_rviz, add an interactive marker object with the topic /simple_marker/update and global frame is /base_link and you run rosrun interactive_marker_tutorials simple_marker and you publish the following

$ rostopic pub /simple_marker/update visualization_msgs/InteractiveMarkerUpdate "server_id: '/simple_marker'
seq_num: 5
type: 0
markers: []
poses: []
erases: []" -r 10

Now the interactive marker has an issue with this message here or here and runs this errorReset(), which calls here the resetCb which is a function of RViz here resp. here. This reset deletes the markers. BUT since the Ros callback (and therefore the reset function) was called in a non gui thread and at the same time the interactive marker does some periodic update (which runs always in the gui thread) and the whole thing is not thread safe at all, RViz/interactive marker crashes. This could be solve with signal/slots for the callback functions is InteractiveMarkerDisplay but I don't know if this is the only issue in RViz.