Highy available open-rmf and adapters #78

cwrx777 · 2021-07-15T05:52:09Z

cwrx777
Jul 15, 2021

Hi,

Can open-rmf core, supporting nodes, and the adapters be made highly available or load balanced? or is it already implemented to a certain extend?

Answered by mxgrey

Jul 15, 2021

We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There's still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn't be anything else that can bring down the traffic management system.

We also have plans to work on graceful fail over for fleet adapters so that task management can…

View full answer

gbiggs · 2021-07-15T06:50:31Z

gbiggs
Jul 15, 2021
Maintainer

We are currently undergoing a push to add redundancy and robustness-to-failure to OpenRMF. Some of the nodes used by the core are already redundant in the main branch. I don't think we have any plans at the current time to enable load balancing, however.

0 replies

mxgrey · 2021-07-15T07:05:14Z

mxgrey
Jul 15, 2021
Maintainer

We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There's still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn't be anything else that can bring down the traffic management system.

We also have plans to work on graceful fail over for fleet adapters so that task management can operate without any risk of interruption, but that effort has not started yet. For any other sub-systems in RMF, we'll need to look at them on a case-by-case basis to determine a good fail over strategy. There are many different kinds of systems at play, so I don't expect to get a one-size-fits-all solution.

Regarding load balancing, we make a point to keep the communication and processing extremely efficient, so I expect it to scale well enough that explicit load balancing of specific services won't be needed, at least for the traffic management system. For one thing, the calculation of motion plans is already distributed across many processes by design, and the negotiation to resolve traffic conflicts is done peer-to-peer rather than bottle-necked by a single process.

There is one potential bottle-neck I can think of in the system, which is the traffic schedule node's conflict detection. The traffic schedule node takes in all the traffic plans of all the robots and compares them against each other to identify upcoming traffic conflicts. However, this isn't really as bad as it might sound, because that conflict detection takes place in its own thread, and after each cycle of checking is finished, it grabs a snapshot of the latest state of the whole schedule to do another round of checking. So even if it gets flooded with changes in between conflict-detection cycles, it will simply jump itself to the latest schedule version. So even if it falls behind to some degree, it will always catch back up. It may be plausible to load balance this responsibility by creating multiple traffic schedule nodes and designating each node to check conflicts for a subset of the robots, but I don't think that will be necessary in the near future, and it can be added later without any modification to any of the RMF APIs or specifications. So I don't plan on pursuing that until benchmarks and user requirements start to indicate that it will be worthwhile.

5 replies

cwrx777 Jul 16, 2021
Author

Hi @gbiggs and @mxgrey,

Thanks for the answers.

For adapter nodes that expose REST API such as api-server, is it possible to run multiple of them for load balancing purpose?

mxgrey Jul 16, 2021
Maintainer

The web team (cc @koonpeng) would have a better sense than I do of how to apply load balancing for the web components. But off hand, I can't think of any reason there would be a problem.

koonpeng Jul 17, 2021
Collaborator

Currently nodes like the api-server cannot be ran multi-instance, however, with the recommended kubernetes deployment, new pods can be automatically regenerated if the app or the server goes down (assuming you have a multi node kubernetes cluster).

There are startup routines to synchronize the data with rmf core so that loss of data is kept minimal if that ever occurs.

The reason we can't support multi instance is due to several limitations.

In order to support some features like health monitoring, the api server depends on time-series and stateful data from rmf. Multiple instances of the app will not receive the same set of data at the same time, leading to conflicting results.
It is bottlenecked by rmf, as long as rmf does not support multi instance, scaling up the api server will not have any performance improvements.

If point 1 is solved, then we could have multiple instance of api server, but that is only useful if there is a need to support a large amount of concurrent read-only users, as any commands like submitting tasks will be bottlenecked by rmf's capacity.

cwrx777 Aug 18, 2021
Author

In this pull request, when the original schedule node (primary) comes back online, does the monitor node become the backup node again, performing monitoring role?

mxgrey Aug 18, 2021
Maintainer

The out of the box monitor node will simply become a schedule node and stop being a monitor node. Depending on exactly what kind of fail over you want, you'll need to take a higher level strategy than just using one simple monitor node.

One strategy might be to have a watchdog that watches for when a fail over even happens, and then forks off a new monitor node so that the system is ready if another fail over occurs. The limitation of this strategy is that it doesn't help if the computer running the schedule node + monitor node + watchdog gets disconnected from the network.

Since the loss of a schedule node is very significant and should be quite rare, our current intention is that there would just be one simple monitor node running, and when a fail over occurs that monitor node would become the new schedule node, and the operators would be notified that something Very Bad happened. Then the operators would be armed with a playbook to decide where they should spin up the next monitor node depending on exactly what type of failure happened. But this strategy is subject to change as we learn more about what kind of architecture users will want for their deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highy available open-rmf and adapters #78

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Highy available open-rmf and adapters #78

cwrx777 Jul 15, 2021

Replies: 2 comments · 5 replies

gbiggs Jul 15, 2021 Maintainer

mxgrey Jul 15, 2021 Maintainer

cwrx777 Jul 16, 2021 Author

mxgrey Jul 16, 2021 Maintainer

koonpeng Jul 17, 2021 Collaborator

cwrx777 Aug 18, 2021 Author

mxgrey Aug 18, 2021 Maintainer

cwrx777
Jul 15, 2021

Replies: 2 comments 5 replies

gbiggs
Jul 15, 2021
Maintainer

mxgrey
Jul 15, 2021
Maintainer

cwrx777 Jul 16, 2021
Author

mxgrey Jul 16, 2021
Maintainer

koonpeng Jul 17, 2021
Collaborator

cwrx777 Aug 18, 2021
Author

mxgrey Aug 18, 2021
Maintainer