Failover functionality for fleet adapters

Question

Failover functionality for fleet adapters

Opened this issue 3 years ago · 10 comments

Currently restarting a fleet adapter after a crash does not restore the tasks assigned to the fleet. Ideally, the fleet adapter is aware of exactly which phase was active for each robot and is able to command its robots to pick up where it left off before the crash. This requires

Ability to serialize the state of the fleet adapter into its robots, task queue per robot, current phase of task, etc. This information can be logged into a yaml file.
A backup node that can take over after the crash and is initialized using serialized state of the previous instance.

The serialization of task state may be achieved by open-rmf/rmf_task#32
Recovery/continuation of a phase upon restart may be dictated from error handling protocol for that particular phase open-rmf/rmf_task#33

Answer 1 · 2021-09-08T09:54:59.000Z

Briefly thinking the possible scenarios where the fleet adapter can crash and need different error handling:

currently executing a task / responsive-wait
Currently negotiating a conflict
Just sent out a bid for a task, waiting for response

Answer 2 · 2021-09-08T10:10:10.000Z

Hmm looking at this its pretty similar to what we are doing for our schedule node. The way that works is essentially to serialize state whenever there is a change in the state of the participant registry. I think a similar strategy will be sufficient. That being said, in the current implementation, if there is a change in the definition of a task we would have to rewrite it. Part of me wonders if there is a better abstraction that can be reused across all nodes rather than individually implementing this node by node. For failover we have @marcoag's stubborn_buddies, but the serialization still is pretty manual.

Answer 3 · 2021-09-09T02:31:53.000Z

I'll go ahead and start listing out the pieces of information that need to be serialized as part of the fleet adapter state. This list might grow as I think of more things:

Task request assignments for the fleet adapter
Progress of active tasks

For "Progress of active tasks" the redesign that I'm working on will introduce a backup() function whose role is to provide a serialized representation of the task's current state. These backups will be issued by the tasks whenever a new task milestone is met that would be relevant for restoring the task to its current state if a process or network loss occurs. The new API also has a restore(~) function to restore the state of the task using the backup information.

I would recommend we take a two-pronged approach to backup and restoring:

We have topics that publish the serialized states so that failover nodes on other machines can listen in and pick things back up where they left off when a crash or disconnection occurs.
We save the serialized states to the filesystem of the fleet adapter so that if the process crashes we have the choice of restarting the fleet adapter on the same machine without the need for a separate listener.

Answer 4 · 2021-09-09T03:49:20.000Z

Here's a question that I think is worth posing:

In the redesign, the tasks are going to have logs which track important events that occur while the task is being executed. Should the log information be included in the backup, and get restored when a task restored? Or should the backup data of a task just be the minimal amount of information needed to have the task resume from where it left off? I'm leaning towards the latter, since the log information should be getting published live as it's produced, so there shouldn't be much value in saving and restoring it as far as I can figure.

Answer 5 · 2021-09-09T03:52:56.000Z

I'm leaning towards the latter as well, it might be more organized if each log corresponds exactly to the lifetime of a fleet adapter process.

Answer 6 · 2021-09-09T04:05:34.000Z

I'm wondering if there may be components from these classes,

https://github.com/open-rmf/rmf_ros2/blob/main/rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/YamlLogger.cpp
https://github.com/open-rmf/rmf_ros2/blob/main/rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/YamlSerialization.cpp

which might be duplicated in the backup and restore functions in the new API? ( and other possible failover capabilities in the dispatcher class as well )

Answer 7 · 2021-09-09T05:48:45.000Z

I wrote those very specifically for the scheduler node so I don't know how useful they will be outside of the scheduler node. One thing to note with my design was that we could swap out logging backends rather than being yaml specific (this is useful for writing tests). The abstract classes are here:
https://github.com/open-rmf/rmf_ros2/blob/fcb3123707cbe6baead1e7bafd9fe7aeadc25183/rmf_traffic_ros2/include/rmf_traffic_ros2/schedule/ParticipantRegistry.hpp

I don't know if we actually need that or want that since the backup already will generate a string. Another thing to note is that in its current state backup will likely make yaml-cpp or whatever other serialization a dependency of the the task library.

I guess the only useful thing would be the way YamlLogger loads individual AtomicOps without writing them back to file and at the same time with minimal changes to rmf_traffic or the node itself.

Answer 8 · 2021-09-09T07:26:59.000Z

I agree with @arjo129. I think the YamlLogger/Serialization for the traffic schedule will make a great reference point, but the implementation is different enough from what we need that I wouldn't worry about trying to borrow it directly.

Since these are all likely to be implementation details (not part of the public API), we can start by prototyping the backup system independently, and then consider refactoring/redesigning if we discover there is more overlap than we originally anticipated.

Answer 9 · 2021-09-13T06:09:20.000Z

i am understanding that we will be going ahead with this prototyping, making it work for the current API, while keeping in mind to make it easy to migrate to the new backup() and restore(~) API functions. Do let me know if i am mistaken 🙇

I am considering doing the following to start prototyping ( heavily inspired by the schedule logging system, thanks ):

Add a YamlLogger.cpp in rmf_fleet_adapter/src/rmf_fleet_adapter/tasks that handles interfacing with the local serialized task "database file" as mentioned in point 2 here.
Add a file internal_YamlSerialization.hpp in rmf_fleet_adapter/src/rmf_fleet_adapter/tasks with function declarations such as
- rmf_fleet_adapter::Task [ChargeBattery|Loop|...] task(YAML:Node node);
- YAML::Node serialize(rmf_fleet_adapter::[ChargeBattery|Loop|...] ) task;
Add a YamlSerialization.cpp in rmf_fleet_adapter/src/rmf_fleet_adapter/tasks that implements internal_YamlSerialization.hpp. This part is probably something like implementing backup() in the new API for each of the existing Tasks we already have.
Add a separate private ROS2 node, perhaps SerializedStatePublisher.[hpp|cpp], in rmf_fleet_adapter/src/rmf_fleet_adapter/tasks that listens to updates on the local serialized task database, and publishes the file YAML contents over a ROS2 topic, as mentioned in point 1 here.
Add code to Initialize a YamlLogger and SerializedStatePublisher in Adapter::Implementation.
To handle migrating to the new API, we can rewrite the implementation in YamlSerialization.cpp to simply call the rmf_fleet_adapter::Task::[backup|restore] functions internally.

I am not sure if it makes sense to do an abstraction similar to the work done in ParticipantRegistry/AtomicOperations for schedule node logging as it would likely be (re-)implemented in the new API.

Almost definitely missed out many important considerations, but hopefully I can pick your brains if this is the general direction to move towards 🕺

Answer 10 · 2021-09-13T06:51:16.000Z

I think with how extreme the upcoming redesign is, it doesn't necessarily make sense to try to target a prototype at the current fleet adapter implementation and then try to pivot it to the new API after it's available.

Instead I think it would be more productive to target your effort at the new API with total disregard for the current fleet adapter implementation. Then when we get around to reimplementing the fleet adapter for the new design, it will be with the benefit of the effort that you will have put into the fail over system.

I understand it's harder to hit a moving target (i.e. the redesign that I'm doing) than a fixed target (i.e. the existing, soon-to-be-deprecated fleet adapter implementation), but I suspect targeting the fleet adapter as-is will end up generating a lot of work that will need to be discarded because it won't fit into the redesign.

Feel free to reach out to me if you'd like some guidance on where to dig into this issue and what kind of effort could be beneficial.