Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rescheduling: SelfIPs and subnet routes not cleaned up before creation #214

Merged
merged 10 commits into from
Apr 14, 2023

Conversation

BenjaminLudwigSAP
Copy link
Collaborator

@BenjaminLudwigSAP BenjaminLudwigSAP commented Apr 12, 2023

ControllerWorker.add_loadbalancer calls ensure_l2_flow unconditionally.
If a load balancer of the same network (but a different subnet) already exists on this device there will be a subnet route (the one for the subnet of the LB-to-add) that needs cleaning up before the SelfIP for the subnet of the LB-to-add can be created.
Also, the whole ensure_l2_flow isn't needed in this case. So execute only sync_l2_selfips_and_subnet_routes_flow in this case.
Same for ControllerWorker.remove_loadbalancer.

Closes #215

…already here

ensure_l2_flow is called from ControllerWorker.add_loadbalancer
unconditionally. If a load balancer of the same network (but a
different subnet) already exists on this device there will be a subnet
route (the one for the subnet of the LB-to-add) that needs cleaning up
before the SelfIP for the subnet of the LB-to-add can be
created. Also, the whole ensure_l2_flow isn't needed in this case.

So execute only sync_l2_selfips_and_subnet_routes_flow in this case.
@BenjaminLudwigSAP BenjaminLudwigSAP added the rescheduling Relevant for rescheduling semantics in some way label Apr 12, 2023
Copy link
Collaborator

@notandy notandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR includes refactoring that doesn't help reviewing this PR. Could you put it extra, pls.

octavia_f5/controller/worker/controller_worker.py Outdated Show resolved Hide resolved
octavia_f5/controller/worker/controller_worker.py Outdated Show resolved Hide resolved
Also: Refactor: lb.server_group_id == CONF.host is not needed, since
LoadBalancerRepository.get_all_by_network() already filters by host.
Copy link
Collaborator

@notandy notandy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the commit msg, it fixes #215. But in my test, selfips still not cleaned up.

https://github.com/sapcc/octavia-f5-provider-driver/pull/214/files#diff-a8118ed2a24f98b55a7b209da50d45df02e002a7cf1ef5f1ef462893e7af0e47R646-R657

        loadbalancers = self._get_all_loadbalancer(network_id)
        loadbalancers = [_lb for _lb in loadbalancers if _lb.id != load_balancer_id]
        selfips = list(chain.from_iterable(
            self.network_driver.ensure_selfips(loadbalancers, CONF.host, cleanup_orphans=False)))
        if loadbalancers:
            self.l2sync.sync_l2_selfips_and_subnet_routes_flow(selfips, network_id)
            self.sync.tenant_update(network_id, selfips=selfips, loadbalancers=loadbalancers).raise_for_status()
        else:
            self.sync.tenant_delete(network_id).raise_for_status()
            self.l2sync.remove_l2_flow(network_id)
            self.network_driver.cleanup_selfips(selfips)

So let's assume there is only one loadbalancer left, it's removed from the the list of loadbalancers in line L639, thus ensure_selfips in L642 gets an empty array, and doesn't yield any selfips.

This makes L649, cleanup_selfips a no-op for all possible cases.

I am still testing out multiple combinations.

@BenjaminLudwigSAP
Copy link
Collaborator Author

Good find. But please make sure that you don't run into the problem of #209, where the SelfIP is not deleted due to it being in the same subnet as the default route.
I will test the scenario you described.

@notandy
Copy link
Collaborator

notandy commented Apr 13, 2023

Good find. But please make sure that you don't run into the problem of #209, where the SelfIP is not deleted due to it being in the same subnet as the default route.

Oh I absolutely want to run into that problem and see how severe it is ;)

…eated

This unit test should succeed if bug #215 is fixed
@notandy
Copy link
Collaborator

notandy commented Apr 13, 2023

So I tried multiple constellations, and only found issues with following setup causing #209 selfips not deleted:

Steps to reproduce

Have two loadbalancers in the same network but different subnets. Let say

  • Loadbalancer 1, uses Subnet 1
    • GW 10.0.0.1
  • Loadbalancer 2, uses Subnet 2
    • GW 192.168.0.1

Default Route: 10.0.0.1

-> Reschedule loadbalancer 1 away

DEBUG octavia_f5.controller.worker.l2_sync_manager [-] <taskflow.engines.action_engine.engine.ParallelActionEngine object at 0x10d21a970> has moved task 'octavia_f5.controller.worker.tasks.f5_tasks.EnsureSubnetRoutes' (d5832532-19bc-4f8f-b46c-737968e45b31) into state 'FAILURE' from state 'RUNNING' with result '

Failure: octavia_f5.utils.exceptions.IControlRestException: HTTP 400 for POST https://None:???@bigip-ve-2.cc.qa-de-1.cloud.sap/mgmt/tm/net/route: 01070666:3: Static route duplicates Self IP 10.0.0.0%2022 / 255.255.255.0 implied route' (failure=True) _dump /Users/d072895/.virtualenvs/yoga_python3.8.10/lib/python3.8/site-packages/taskflow/listeners/logging.py:56

Issue

This is because sync_l2_selfips_and_subnet_routes_flow is not updating the default route at all.

'{"code":400,"message":"01070604:3: Cannot delete IP 10.0.0.152%2022 because it would leave a route unreachable.","errorStack":[],"apiError":3}'

_do_sync_l2_selfips_and_subnet_routes_flow Flow is trying to remove the selfip of the loadbalancer 1, but the default route needs this selfip to reach the network.

Probable solution: Update default route in sync_l2_selfips_and_subnet_routes_flow as the first thing (beware, it cannot be done in parallel for both devies since routes are synced automatically), then do the rest, cleanup and sync of selfips and static routes.

Risk assessment

Do we need to fix it for saturday? probably not, the selfips will be not used anymore (and shouldn't clash with anything - since they have been allocated for this specific device/subnet combination).
The default gateway is also not important as long it's a valid default gateway.

a1510e4 only switched out ensure_l2_flow for
sync_l2_selfips_and_subnet_routes_flow in the remove_loadbalancer
function, however one issue remained[1]:

When the LB to be removed is the last one on this device, the
shadowing loadbalancers variable is set to an empty array, which is
then supplied as argument to ensure_selfips, which in turn yields no
SelfIP ports. This means that when the else-branch is executed,
selfips will be [], so cleanup_selfips(selfips) won't delete any
ports. This is due to the loadbalancers variable being shadowed. This
error was introduced in 101fa5f with the first rescheduling
implementation.

The fix:
Instead, a new variable loadbalancers_remaining has to be created,
which contains all load balancers on this device except the one that
is to be removed.
- If there are load balancers remaining and the one to be removed was
  the last of its subnet, that subnet's SelfIP ports will be removed
  by ensure_selfips(..., cleanup_orphans=True), before the remaining
  SelfIPs (and subnet routes) are synced.
- If there are no load balancers remaining, then selfips contains only
  the SelfIP ports for the load balancer to be removed, which is then
  cleaned up with everything else L2.

[1] #214 (review)
- Prevent ControllerWorker() from immediately spawning threads upon
  initialization
- Fix: ensure_selfips is not called with a list of LB IDs, but a list
  of LB model objects.
@BenjaminLudwigSAP
Copy link
Collaborator Author

Very good observation about #215 not being fixed. I corrected that in another commit and also fixed the unit test in yet another commit.

@BenjaminLudwigSAP
Copy link
Collaborator Author

BenjaminLudwigSAP commented Apr 14, 2023

Right now I am testing for the three possible scenarios (1. LBs remain in same subnet, 2. LBs remain but all from other subnet(s), 3. no LBs remain) and I am seeing strange behavior for the second scenario. The SelfIP for the subnet is deleted from the source F5, but the corresponding SelfIP port is not. I pushed the commits anyway so you can have a look already.

Edit: Apparently the reason for the behavior I see is the semantics of the cleanup_orphans parameter on the ensure_selfips function. A SelfIP port is considered orphaned when e.g. its subnet is not in the list of needed subnets. However, only the SelfIPs for the subnets of the load_balancers argument are considered. Since I don't supply the LB-to-remove in that argument, the SelfIP port from its subnet isn't even considered and thus not checked whether it's orphaned. Were I to supply the LB-to-remove, then it would count as needing the SelfIP port, so the port would still not be deleted. Ideally, the function should know that there is a subnet it has to consider, but that the subnet does not have LBs.

If the subnet of the LB to be removed is now empty, remove the
unneeded SelfIP ports. Since they are not considered orphaned (they
aren't even considered by ensure_selfips because the subnet is gone)
they need to be determined by comparing SelfIP ports for all LBs with
those of the remaining LBs.
Copy link
Contributor

@m-kratochvil m-kratochvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock roll-out. PR was discussed earlier in the Octavia round.

@notandy notandy self-requested a review April 14, 2023 17:16
@BenjaminLudwigSAP BenjaminLudwigSAP merged commit 903cc55 into stable/yoga-m3 Apr 14, 2023
@BenjaminLudwigSAP BenjaminLudwigSAP deleted the fix-rescheduling branch April 14, 2023 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rescheduling Relevant for rescheduling semantics in some way
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rescheduling: SelfIPs not deleted, only created
3 participants