-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blueprint execution stuck on sled-agent failed requests #7373
Comments
|
Logged into the sled in question. It's a scrimlet and its sled-agent was up and running:
There are however many warnings in the log file related to DNS and client disconnection (the latter corresponding to the blueprint_executor error):
At this point, there are 4 internal_dns zones. The one being expunged (
|
The newly provisioned internal_dns zone (not the one being expunged) appears to be the problematic DNS peer that's returning the DNS lookup errors
The complete log file of the internal_dns at startup time: |
The problem internal_dns zone
The sled appears healthy generally and the disk space of the dataset being used for provisioning new zones seems adequate:
|
I've also copied to |
Haven't dug deeply into this, but I think there are multiple things going on here, some of which we know and some of which are new:
|
Digging into this one a little bit:
It does look like the internal DNS propagation background task is unhappy:
From your blueprint above, |
Trying to trace down why the DNS propagation RPW is failing: I think there's a sled-agent bug here (will open a separate issue once I'm more sure). From sled 17 where the new internal DNS zone is running, we can get to
But those same requests appear to time out from the other three sleds' gzs. There is a Nexus zone running on sled 17, and if we specifically check it, we see that its DNS propagation task has succeeded:
which means all three of the DNS servers should have records to serve. But from within the switch zone, we get three different results when querying the three DNS servers.
|
Why can only sled 17 get to
Both omicron/sled-agent/src/services.rs Lines 2283 to 2285 in cdf48c8
but there's no corresponding equivalent to tell maghemite to withdraw that advertisement when shutting down an internal DNS zone, AFAICT. (Off the top of my head I'm not entirely sure what that would look like, since we'd have to be careful not to withdraw a prefix if we ourselves had already started a different zone with that same prefix? I'm not sure whether that's a possible scenario.) Maybe we're seeing two sleds with that prefix because the sled that was running the original |
Things look even worse when we look at
The DNS server on sled 14 (the new one) is the one returning no records:
and the one that should be expunged on sled 15 has records:
Presumably all three Nexus zones are only getting to sled 15? That would explain why sled 14 has no records, I think (it was just started and never got a sync from any Nexus). I'm not sure how to confirm that at the maghemite layer, but using "does the DNS server have records" as a proxy it looks like that's right; all three of them get successful responses so presumably are talking to the should-be-expunged internal DNS zone on sled 15:
|
We met to go over this and believe we understand all of the issues above.
Given the set of more specific issues, I'm going to close this one. Thanks again @askfongjojo! |
I just wanted to add a little more detail from my recollection. An individual internal DNS zone was expunged from a sled as part of a physical disk expungement, but as John mentioned, the system never withdrew the advertisement (#7377). Then the same internal DNS IP was put onto another sled. In terms of DNS propagation: the Nexus instances trying to propagate DNS to the new zone were either timing out (because the packets were being sent to the wrong place? I don't actually fully understand this because both zones were running) or succeeding incorrectly (because they were reaching the first zone). The result was that the newly-deployed zone wasn't getting the DNS config. This caused queries to fail with the error mentioned above. Some of those queries were happening in the "zone boot" code path, via opte_ports_needed() -- this is where #7381 comes in. |
FWIW: they were succeeding incorrectly (i.e., they were successfully propagating records to the zone on the "expunged" sled and not talking to the newly-started zone at all). |
In a racklet environment, I ran into blueprint_executor errors after expunging a sled (note: prior to this, I expunged some disks on a different sled but executed blueprint updates without issues).
Enabled the new blueprint and saw that the executor failed to complete the new blueprint:
(complete error message and other debug info captured in the following comments due to character-length limit for issue comment)
The text was updated successfully, but these errors were encountered: