Rescheduling: Add safety checks #216

BenjaminLudwigSAP · 2023-04-12T12:45:04Z

Sanity checks for rescheduling a load balancer are so far handled in the client script. The sanity checks should wander into Octavia. In the interest of failing as early as possible they should happen in the API already, i.e. in the arbiter. This issue replaces #190, even though that PR can still serve as inspiration for implementation or something

API (simple checks only!)

LB is not deleted
Is this already checked? => yes
Target host exists
Target host is not source host
All certificates are still fetchable

Screw

Target hosts has the same AZ awareness as the source hosts (read-only, parallelizable)
I.e. either both are cross-AZ or both are in the same AZ.
The VIP ports of all LBs exist
Current AS3 declarations work (parallelizable by host)
Test that the current AS3 declaration works by re-sending it to the old device.
We had cases in which the declaration didn't work at all, but the LBs were ACTIVE anyway, so we only found out that they didn't work, when they failed on the new worker and then again after rollback to the old worker. This can e.g. be the case when a barbican secret has been deleted since the last update.
When rescheduling can be done network-wise, this check would only have to be done once per network.
AS3 declaration on target host was successful (so far we're checking by e.g. updating the LB without changing anything. If the AS3 declaration doesn't work, we notice by observing the LB being stuck in PENDING_UPDATE)
There is enough quota for SelfIPs (read-only, parallelizable)
Each project needs at least two additional port quota and at most as much additional port quota as sum_by_subnet(number of devices that don't have LBs for this subnet).
There are enough free IP addresses for new SelfIP ports (no IP exhaustion) (read-only, parallelizable)
Lest Neutron error out with Error creating selfips for network <NETWORK_ID>: RetryError[<Future at <ADDRESS> state=finished raised IpAddressGenerationFailureClient>].
Rescheduling can still be done in this case if SelfIPs exist for another device for the same subnet. Rescheduling then has to be done towards that device.
Security Groups on pool members allow for whole health monitor subnets (read-only, parallelizable)
As opposed to single health monitor IPs.
Possible solutions are discussed in Proposal for mitigating one of the rescheduling risks (security groups / monitor IPs) #237

The text was updated successfully, but these errors were encountered:

BenjaminLudwigSAP added enhancement New feature or request good first issue Good for newcomers rescheduling Relevant for rescheduling semantics in some way labels Apr 12, 2023

BenjaminLudwigSAP self-assigned this Apr 12, 2023

BenjaminLudwigSAP mentioned this issue Apr 12, 2023

[rescheduling] Add sanity check #190

Closed

BenjaminLudwigSAP changed the title ~~Rescheduling: Add sanity checks~~ Rescheduling: Add sanity/safety checks Apr 12, 2023

BenjaminLudwigSAP changed the title ~~Rescheduling: Add sanity/safety checks~~ Rescheduling: Add safety checks Apr 12, 2023

m-kratochvil mentioned this issue Sep 24, 2023

Proposal for mitigating one of the rescheduling risks (security groups / monitor IPs) #237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rescheduling: Add safety checks #216

Rescheduling: Add safety checks #216

BenjaminLudwigSAP commented Apr 12, 2023 •

edited

Loading

Rescheduling: Add safety checks #216

Rescheduling: Add safety checks #216

Comments

BenjaminLudwigSAP commented Apr 12, 2023 • edited Loading

API (simple checks only!)

Screw

BenjaminLudwigSAP commented Apr 12, 2023 •

edited

Loading