-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rescheduling: SelfIPs and subnet routes not cleaned up before creation #214
Conversation
…already here ensure_l2_flow is called from ControllerWorker.add_loadbalancer unconditionally. If a load balancer of the same network (but a different subnet) already exists on this device there will be a subnet route (the one for the subnet of the LB-to-add) that needs cleaning up before the SelfIP for the subnet of the LB-to-add can be created. Also, the whole ensure_l2_flow isn't needed in this case. So execute only sync_l2_selfips_and_subnet_routes_flow in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR includes refactoring that doesn't help reviewing this PR. Could you put it extra, pls.
Also: Refactor: lb.server_group_id == CONF.host is not needed, since LoadBalancerRepository.get_all_by_network() already filters by host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the commit msg, it fixes #215. But in my test, selfips still not cleaned up.
loadbalancers = self._get_all_loadbalancer(network_id)
loadbalancers = [_lb for _lb in loadbalancers if _lb.id != load_balancer_id]
selfips = list(chain.from_iterable(
self.network_driver.ensure_selfips(loadbalancers, CONF.host, cleanup_orphans=False)))
if loadbalancers:
self.l2sync.sync_l2_selfips_and_subnet_routes_flow(selfips, network_id)
self.sync.tenant_update(network_id, selfips=selfips, loadbalancers=loadbalancers).raise_for_status()
else:
self.sync.tenant_delete(network_id).raise_for_status()
self.l2sync.remove_l2_flow(network_id)
self.network_driver.cleanup_selfips(selfips)
So let's assume there is only one loadbalancer left, it's removed from the the list of loadbalancers in line L639, thus ensure_selfips
in L642 gets an empty array, and doesn't yield any selfips.
This makes L649, cleanup_selfips
a no-op for all possible cases.
I am still testing out multiple combinations.
Good find. But please make sure that you don't run into the problem of #209, where the SelfIP is not deleted due to it being in the same subnet as the default route. |
Oh I absolutely want to run into that problem and see how severe it is ;) |
…eated This unit test should succeed if bug #215 is fixed
So I tried multiple constellations, and only found issues with following setup causing #209 selfips not deleted: Steps to reproduceHave two loadbalancers in the same network but different subnets. Let say
Default Route: 10.0.0.1 -> Reschedule loadbalancer 1 away DEBUG octavia_f5.controller.worker.l2_sync_manager [-] <taskflow.engines.action_engine.engine.ParallelActionEngine object at 0x10d21a970> has moved task 'octavia_f5.controller.worker.tasks.f5_tasks.EnsureSubnetRoutes' (d5832532-19bc-4f8f-b46c-737968e45b31) into state 'FAILURE' from state 'RUNNING' with result '
Failure: octavia_f5.utils.exceptions.IControlRestException: HTTP 400 for POST https://None:???@bigip-ve-2.cc.qa-de-1.cloud.sap/mgmt/tm/net/route: 01070666:3: Static route duplicates Self IP 10.0.0.0%2022 / 255.255.255.0 implied route' (failure=True) _dump /Users/d072895/.virtualenvs/yoga_python3.8.10/lib/python3.8/site-packages/taskflow/listeners/logging.py:56 IssueThis is because
Probable solution: Update default route in Risk assessmentDo we need to fix it for saturday? probably not, the selfips will be not used anymore (and shouldn't clash with anything - since they have been allocated for this specific device/subnet combination). |
a1510e4 only switched out ensure_l2_flow for sync_l2_selfips_and_subnet_routes_flow in the remove_loadbalancer function, however one issue remained[1]: When the LB to be removed is the last one on this device, the shadowing loadbalancers variable is set to an empty array, which is then supplied as argument to ensure_selfips, which in turn yields no SelfIP ports. This means that when the else-branch is executed, selfips will be [], so cleanup_selfips(selfips) won't delete any ports. This is due to the loadbalancers variable being shadowed. This error was introduced in 101fa5f with the first rescheduling implementation. The fix: Instead, a new variable loadbalancers_remaining has to be created, which contains all load balancers on this device except the one that is to be removed. - If there are load balancers remaining and the one to be removed was the last of its subnet, that subnet's SelfIP ports will be removed by ensure_selfips(..., cleanup_orphans=True), before the remaining SelfIPs (and subnet routes) are synced. - If there are no load balancers remaining, then selfips contains only the SelfIP ports for the load balancer to be removed, which is then cleaned up with everything else L2. [1] #214 (review)
- Prevent ControllerWorker() from immediately spawning threads upon initialization - Fix: ensure_selfips is not called with a list of LB IDs, but a list of LB model objects.
Very good observation about #215 not being fixed. I corrected that in another commit and also fixed the unit test in yet another commit. |
Right now I am testing for the three possible scenarios (1. LBs remain in same subnet, 2. LBs remain but all from other subnet(s), 3. no LBs remain) and I am seeing strange behavior for the second scenario. The SelfIP for the subnet is deleted from the source F5, but the corresponding SelfIP port is not. I pushed the commits anyway so you can have a look already. Edit: Apparently the reason for the behavior I see is the semantics of the |
If the subnet of the LB to be removed is now empty, remove the unneeded SelfIP ports. Since they are not considered orphaned (they aren't even considered by ensure_selfips because the subnet is gone) they need to be determined by comparing SelfIP ports for all LBs with those of the remaining LBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock roll-out. PR was discussed earlier in the Octavia round.
ControllerWorker.add_loadbalancer
callsensure_l2_flow
unconditionally.If a load balancer of the same network (but a different subnet) already exists on this device there will be a subnet route (the one for the subnet of the LB-to-add) that needs cleaning up before the SelfIP for the subnet of the LB-to-add can be created.
Also, the whole
ensure_l2_flow
isn't needed in this case. So execute only sync_l2_selfips_and_subnet_routes_flow in this case.Same for
ControllerWorker.remove_loadbalancer
.Closes #215