Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rescheduling: Add safety checks #216

Open
1 of 11 tasks
BenjaminLudwigSAP opened this issue Apr 12, 2023 · 0 comments
Open
1 of 11 tasks

Rescheduling: Add safety checks #216

BenjaminLudwigSAP opened this issue Apr 12, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers rescheduling Relevant for rescheduling semantics in some way

Comments

@BenjaminLudwigSAP
Copy link
Collaborator

BenjaminLudwigSAP commented Apr 12, 2023

Sanity checks for rescheduling a load balancer are so far handled in the client script. The sanity checks should wander into Octavia. In the interest of failing as early as possible they should happen in the API already, i.e. in the arbiter. This issue replaces #190, even though that PR can still serve as inspiration for implementation or something

API (simple checks only!)

  • LB is not deleted
    Is this already checked? => yes
  • Target host exists
  • Target host is not source host
  • All certificates are still fetchable

Screw

  • Target hosts has the same AZ awareness as the source hosts (read-only, parallelizable)
    I.e. either both are cross-AZ or both are in the same AZ.
  • The VIP ports of all LBs exist
  • Current AS3 declarations work (parallelizable by host)
    Test that the current AS3 declaration works by re-sending it to the old device.
    We had cases in which the declaration didn't work at all, but the LBs were ACTIVE anyway, so we only found out that they didn't work, when they failed on the new worker and then again after rollback to the old worker. This can e.g. be the case when a barbican secret has been deleted since the last update.
    When rescheduling can be done network-wise, this check would only have to be done once per network.
  • AS3 declaration on target host was successful (so far we're checking by e.g. updating the LB without changing anything. If the AS3 declaration doesn't work, we notice by observing the LB being stuck in PENDING_UPDATE)
  • There is enough quota for SelfIPs (read-only, parallelizable)
    Each project needs at least two additional port quota and at most as much additional port quota as sum_by_subnet(number of devices that don't have LBs for this subnet).
  • There are enough free IP addresses for new SelfIP ports (no IP exhaustion) (read-only, parallelizable)
    Lest Neutron error out with Error creating selfips for network <NETWORK_ID>: RetryError[<Future at <ADDRESS> state=finished raised IpAddressGenerationFailureClient>].
    Rescheduling can still be done in this case if SelfIPs exist for another device for the same subnet. Rescheduling then has to be done towards that device.
  • Security Groups on pool members allow for whole health monitor subnets (read-only, parallelizable)
    As opposed to single health monitor IPs.
    Possible solutions are discussed in Proposal for mitigating one of the rescheduling risks (security groups / monitor IPs) #237
@BenjaminLudwigSAP BenjaminLudwigSAP added enhancement New feature or request good first issue Good for newcomers rescheduling Relevant for rescheduling semantics in some way labels Apr 12, 2023
@BenjaminLudwigSAP BenjaminLudwigSAP self-assigned this Apr 12, 2023
@BenjaminLudwigSAP BenjaminLudwigSAP changed the title Rescheduling: Add sanity checks Rescheduling: Add sanity/safety checks Apr 12, 2023
@BenjaminLudwigSAP BenjaminLudwigSAP changed the title Rescheduling: Add sanity/safety checks Rescheduling: Add safety checks Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers rescheduling Relevant for rescheduling semantics in some way
Projects
None yet
Development

No branches or pull requests

1 participant