robotpy/robotpy-commands-v1

Segfault if scheduled command falls out of scope

Closed this issue · 10 comments

Consider the following program:

from commandbased import CommandBasedRobot
from wpilib._impl.main import run
from wpilib.command import Command

class TestCommand(Command):
    def __init__(self):
        super().__init__('Test Command')

    def isFinished(self):
        return False


class TinyBot(CommandBasedRobot):
    def robotInit(self):
        cmd = TestCommand()
        cmd.start()


if __name__ == '__main__':
    run(TinyBot)

When run with python3 robot.py sim, it results in the following crash:

07:03:05:148 INFO    : wpilib              : WPILib version 2020.1.2.3
07:03:05:148 INFO    : wpilib              : HAL version 2020.1.2.3
07:03:05:148 INFO    : wpilib              : Running with simulated HAL.
07:03:05:149 INFO    : faulthandler        : registered SIGUSR2 for PID 1918
07:03:05:151 INFO    : halsim_gui          : WPILib HAL Simulation 2020.1.2.2
HAL Extensions: Attempting to load: libhalsim_gui
Simulator GUI Initializing.
Simulator GUI Initialized!
HAL Extensions: Successfully loaded extension
Not loading CameraServerShared
Default DisabledInit() method... Override me!
Default RobotPeriodic() method... Override me!
Fatal Python error: Segmentation fault

Current thread 0x00007fbdb9ac2700 (most recent call first):
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/commandbased/commandbasedrobot.py", line 24 in commandPeriodic
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/commandbased/commandbasedrobot.py", line 15 in startCompetition
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/wpilib/_impl/start.py", line 94 in start
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/wpilib/_impl/start.py", line 31 in _start
  File "/usr/lib/python3.8/threading.py", line 870 in run
  File "/usr/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Thread 0x00007fbdc2205740 (most recent call first):
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/wpilib/_impl/start.py", line 37 in run
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/halsim_gui/main.py", line 34 in run
  File "/home/frc/tinybot/.venv/lib/python3.8/site-packages/wpilib/_impl/main.py", line 186 in run
  File "./robot.py", line 22 in <module>
Segmentation fault (core dumped)

Commenting out the cmd.start() line allows the simulator to start.

With the following change this does not segfault:

class TinyBot(CommandBasedRobot):
    def robotInit(self):
        self.cmd = TestCommand()
        self.cmd.start()

Whilst segfaulting is not great... is there a particular reason why you're not holding a reference to the command instance you create?

Ooh, that's nasty. We need to find anywhere in the code that Scheduler::GetInstance() is called and make sure we keep references around..

I've added some notes about this particular situation at https://robotpy.readthedocs.io/en/stable/2020_notes.html#my-code-segfaulted-and-there-s-no-python-stack-trace ... we can definitely fix it, but unless someone else does it, I won't get to it until later in the weekend, as I'd like to focus on getting the new trajectory stuff out.

From the Chief Delphi poll, it seems most teams aren't having trouble with this, but needing to keep a reference to every command in a command group is proving frustrating. Can you point me to what I would need to do/understand to add the "keepalive" behavior? I looked at the commit for issue #1, but I couldn't find a good explanation online for the syntax of "keepalive" as used in the yml files.

@restouffer the best reference for what keepalive does would be https://pybind11.readthedocs.io/en/stable/advanced/functions.html#keep-alive. I'll add a reference to that from robotpy-build's internal "docs".

The challenge isn't actually keepalive in the normal send -- that's fairly straightforward.

The problem here is that the command internally passes itself to the scheduler, and pybind11 isn't aware of that reference being passed -- so we need some callback that lets us know that the scheduler is taking a reference to something, and a callback when it's released. This is similar to the SmartDashboard.putData problem.

Internally, the command is notified that it was removed from the scheduler (https://github.com/wpilibsuite/allwpilib/blob/29c82527a5a16126abe3e5dfb4e18ca432c29437/wpilibOldCommands/src/main/native/cpp/commands/Scheduler.cpp#L224), but that callback is private.

A dumb workaround is to make every command nodelete -- so they would never die. However, I have to imagine such a memory leak would be a very bad idea, particularly for the way that you're using it.

Another alternative is copy all of the source code for the command framework here, and change everything to use shared_ptr -- as I don't think wpilib would accept such a change, certainly not until next season. Then the reference counting would be pretty much automatic.

Another alternative is go back to a pure python implementation, but my understanding was that some teams ran into performance problems?

If I'm understanding correctly, the real issue is calling start() on the Command itself, since at that point there is no other object to "keepalive" it to, because adding it to the Scheduler is handled entirely inside C++. Making Commands nodelete would not be a significant issue for us. All the Commands we create are meant to exist throughout the run-time of the program, either to be called on button events, as subsystem default commands, or in command groups.

Yes, that is the issue.

Please give 2020.2.2.2 a try, it should address this issue. Sorry for the delay.

That works. Thanks!