-
Hi, Can open-rmf core, supporting nodes, and the adapters be made highly available or load balanced? or is it already implemented to a certain extend? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 5 replies
-
We are currently undergoing a push to add redundancy and robustness-to-failure to OpenRMF. Some of the nodes used by the core are already redundant in the |
Beta Was this translation helpful? Give feedback.
-
We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There's still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn't be anything else that can bring down the traffic management system. We also have plans to work on graceful fail over for fleet adapters so that task management can operate without any risk of interruption, but that effort has not started yet. For any other sub-systems in RMF, we'll need to look at them on a case-by-case basis to determine a good fail over strategy. There are many different kinds of systems at play, so I don't expect to get a one-size-fits-all solution. Regarding load balancing, we make a point to keep the communication and processing extremely efficient, so I expect it to scale well enough that explicit load balancing of specific services won't be needed, at least for the traffic management system. For one thing, the calculation of motion plans is already distributed across many processes by design, and the negotiation to resolve traffic conflicts is done peer-to-peer rather than bottle-necked by a single process. There is one potential bottle-neck I can think of in the system, which is the traffic schedule node's conflict detection. The traffic schedule node takes in all the traffic plans of all the robots and compares them against each other to identify upcoming traffic conflicts. However, this isn't really as bad as it might sound, because that conflict detection takes place in its own thread, and after each cycle of checking is finished, it grabs a snapshot of the latest state of the whole schedule to do another round of checking. So even if it gets flooded with changes in between conflict-detection cycles, it will simply jump itself to the latest schedule version. So even if it falls behind to some degree, it will always catch back up. It may be plausible to load balance this responsibility by creating multiple traffic schedule nodes and designating each node to check conflicts for a subset of the robots, but I don't think that will be necessary in the near future, and it can be added later without any modification to any of the RMF APIs or specifications. So I don't plan on pursuing that until benchmarks and user requirements start to indicate that it will be worthwhile. |
Beta Was this translation helpful? Give feedback.
We certainly care a lot about ensuring high availability, since loss of services will have a huge negative impact on business operations. We recently merged a feature to support seamless fail over for the traffic schedule node, which is a critical single-point-of-failure component for the traffic management system. There's still a little more work to be done to cover all possible failure modes for the traffic schedule node, but that effort is already under way. As long as that component can fail over gracefully, there shouldn't be anything else that can bring down the traffic management system.
We also have plans to work on graceful fail over for fleet adapters so that task management can…