-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
re-assign sagas from expunged Nexus instances #6215
Conversation
I've successfully tested this on a4x2 by:
I'll do another run of this and post the full details. In terms of automated tests: the underlying datastore function is tested, but there isn't really an end-to-end test here. I can't figure out a great way to do it with the infrastructure we have today. I spent most of today trying to write a test within the ControlPlaneTestContext framework that would insert a saga owned by a different Nexus and then execute a blueprint that shows that Nexus as expunged. I couldn't get to the point where it could successfully execute that blueprint that contains an expunged Nexus. The immediate problem is that the external networking step (the first step in blueprint execution) expects to find the expunged Nexus's external IP in the database, but that's after working around a bunch of other problems in gross ways. I'd welcome better ways to test this, especially since I did find a real bug during the manual testing, but I think the better investment is probably in automation that sets up a real environment (either on real hardware or a4x2) and provides primitives for operations like the above (e.g., setup rack, add sled, expunge sled, make requests to the internal or external APIs, etc.). |
/// | ||
/// To reiterate, we are *not* considering the case where several SECs try | ||
/// to update the same saga. That will be a future enhancement. | ||
/// It's conceivable that multiple SECs do try to udpate the same saga |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed some of the paranoia here because it's going to be too complicated to make it continue to work and it does seem unnecessary at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! When I looked at this approximately 2 months ago I had the same exact thought! I was hoping it wouldn't be a bone of contention (pardon the pun).
Here's the testing on a4x2. I set up a4x2 from commit befc6c6 (on this PR branch) with THREE_NODE_RSS=true. Here's the initial blueprint:
Now let's add that fourth sled:
A few seconds later over on the new sled I see the new internal NTP zone:
To add a new Nexus zone, I'm going to use
A few seconds later, over on the new sled:
Great -- we have the new Nexus zone. Based on the blueprint diff output we can expect its Nexus internal URL is
Great. Now let's expunge that sled. First I'll power it off from its console:
then do the expungement. I'm overriding the warnings because I don't want to wait for an inventory collection. (It might also be necessary to specify a different Nexus URL explicitly, if DNS would happen to pick the one we just powered off. It wasn't a problem for me.)
All of that is just the setup. The key is that having done this, we should now see that our demo saga moved over to the Nexus that first executed that latest blueprint, which is probably the one on which I just set it as the target. Let's look:
Great! Now we should be able to complete it on the new Nexus:
Success! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
/// | ||
/// To reiterate, we are *not* considering the case where several SECs try | ||
/// to update the same saga. That will be a future enhancement. | ||
/// It's conceivable that multiple SECs do try to udpate the same saga |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! When I looked at this approximately 2 months ago I had the same exact thought! I was hoping it wouldn't be a bone of contention (pardon the pun).
let now = chrono::Utc::now(); | ||
let conn = self.pool_connection_authorized(opctx).await?; | ||
|
||
// It would be more robust to do this in batches. However, Diesel does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at diesel side-eyed again 🙄
pub async fn sagas_reassign_sec( | ||
&self, | ||
opctx: &OpContext, | ||
nexus_zone_ids: &[db::saga_types::SecId], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why call them nexus_zone_ids
and not current_sec_ids
or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a little self-inconsistent. I've changed this in 87e5718 and updated the comment so it's less flip-flopping between Nexus ids and SEC ids.
/// activate the saga recovery background task after calling this function | ||
/// to immediately resume the newly-assigned sagas. | ||
/// | ||
/// **Warning:** This operation is only safe if the Nexus instances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth checking that the the nexus zone is expunged here to prevent an issue? On the one hand this seems like a good safety check. On the other hand it seems like we may also end up assigning sagas to another nexus of the same version during rolling upgrade if we don't want to actually wait to drain them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice but I think it's way beyond the scope of this layer. The datastore doesn't know anything about blueprints, expungement, etc. Its caller could, but it's so simple that I don't think an extra check would help:
https://github.com/oxidecomputer/omicron/pull/6215/files#diff-5a1edb7bfa049a98af5ba0e05a8c2455bb637d63d8e59ed738a8e78bd6342ed7R38-R51
let params = steno::SagaCreateParams { | ||
id: steno::SagaId(Uuid::new_v4()), | ||
name: steno::SagaName::new("tewst saga"), | ||
dag: serde_json::value::Value::Null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never thought about how handy a Null
dag would be :)
} | ||
println!("sagas to insert: {:?}", sagas_to_insert); | ||
|
||
// These two sets are complements, but we write out the conditions to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. It makes the test really clear beyond the description above!
Fixes #5136.