-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a boot device field to Instance, forward along as appropriate #6585
Conversation
nexus/db-model/src/instance.rs
Outdated
@@ -63,6 +63,10 @@ pub struct Instance { | |||
#[diesel(column_name = auto_restart_policy)] | |||
pub auto_restart_policy: Option<InstanceAutoRestart>, | |||
|
|||
/// The primary boot device for this instance. | |||
#[diesel(column_name = boot_device)] | |||
pub boot_device: Option<String>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this were just the field it is, it should be Name
, since it is .. the name of some other boot device. but on the Propolis side we'll actually accept something more flexible.
@ahl @askfongjojo @karencfv and @david-crespo i'd like your thoughts on how much we should carry that through: recording a BootOrder
struct which we can then extend in the future is certainly nice for internal purposes, but i'm less sure how to judge the complexity/extensibility tradeoff for the more user-facing API here.
if we have just "boot device" here it seems straightforward in all uses today, but allowing a future boot order list, or "first boot only"-style settings on boot entries, means we'd need a new field and presumably to allow only one of {boot_device, future_more_complex_list}
.
i'm inclined towards keeping just the single optional boot_device
string to avoid the extra layers in user-facing APIs, but i don't know if there's other context i be aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to names being mutable, I'd prefer the ID of the boot device. For users who rely on automation to manage their infrastructure, and immutable identifier is always preferable. Like how we have project_id
as part of this struct :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm inclined towards keeping just the single optional boot_device string to avoid the extra layers in user-facing APIs, but i don't know if there's other context i be aware of.
I agree with this, it seems more future proof as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, ID would be much more typical. Often we allow something to be set by NameOrId, but we resolve it to an ID immediately and store it as an ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! i wasn't sure that they were mutable. in that case, yes, the ID would be much better. it turns out the name is what gets sent out to Propolis, so i'd just stuck with that here, but perhaps Propolis should be getting IDs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd definitely advocate for changing the Propolis API to take a UUID, if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we're creating the instance, we require that InstanceDiskAttach
specify a name, and not an ID. we probably ought to take NameOrId
there. otherwise allowing both here seems to be of limited utility.. the use cases afaict look like:
- if you're creating new disks along with the instance creation, you don't have IDs yet, so you couldn't provide an ID to
boot_disk
. simple enough, strong indicator we should at least take Name here. - if you're attaching existing disks, you must use their Name, so passing a disk ID here seems like net-more work than just reusing the name you'd have put in
disks[]
already. - if you're updating the
boot_disk
of a stopped instance, that seems like the most likely case for an ID on its own to be especially helpful.
i don't want to go change InstanceDiskAttach
here too (basically just because i don't know if that is laden with secret cans of worms), but if folks agree it seems like it would be nice to allow NameOrId sooner than later for disk attachment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we did intend to change that, but I never got around to wrapping it up after Justin left:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if folks agree it seems like it would be nice to allow NameOrId sooner than later for disk attachment.
Definitely on board with this!
one more note: i do expect to land this with a mechanism to update the boot order on a stopped instance. after reading #6321 this morning it looks like there's consensus around this kind of reconfiguration being a PUT to |
3a65608
to
1b21005
Compare
additionally, add a general-purpose instance update endpoint to change the boot device for an existing instance, as well as tests that the constraints around what we can do with boot devices and the boot device field are checked.
if the disk is not in the same project, it will not be attached, so it will not be eligible to be the boot device.
also no longer need authz_project_disk
might return to this later, but as-is i've gotten something wrong with the query..
(and fix a few test references to now-boot_device_id)
"you gotta run cargo xtask openapi generate"
@ahl @david-crespo @karencfv 8b22d56 is the change to have this PR accept an from an API perspective, the difference would be something like: current PR:
vs with 8b22d56:
where since so the biggest wrinkle with @ahl mentions that most of the control plane requests come from other software we maintain (CLI, terraform, etc) so for the time being we can be pretty OK with breaking changes. from @karencfv it sounds that validating the Terraform provider at this point might be more work than we expected a breaking change to be. either way i'm inclined to pick the commit but i'm less confident about changing |
Thanks @iximeow for putting in all the effort to make this a much intuitive API to work with! I'm 100% on board with
It's much more intuitive this way. I'd assume that in this scenario the user would get an error if they set the same disk in Regarding:
If it's just changing the name of the field and none of the functionality, I think it wouldn't be as bad as I thought from the provider side. But there is the issue where we would break all of the terraform configuration files of the customers as they'd have to migrate their HCL config files to use |
perhaps we can think about it a bit more and hold the breaking change for a later release? |
I'm on the fence. I like the I'm concerned about it being annoying to customers for any existing scripts to break as soon as the rack comes back up with after the v11 upgrade. How do we minimize that pain? It feels weird to say this because we break APIs all the time, but I don't have a good feel for how that goes for customers. Do we just warn them hey, make sure you upgrade to the latest CLI and SDKs and Terraform (the release notes always say this at the top), and they do it and it's fine? I tried to game out whether we gain something by waiting until next release to break the API, but in most scenarios I could think of, we end up incurring the same pain later as now, so we might as well do it now. For example: if most instance creates in the real world use one disk, and it's the boot disk, in theory people can move that disk attachment over to |
Yeah, generally I list any breaking change as part of the changelog and release notes. I might have not phrased it correctly, but it wasn't so much saying that we shouldn't change it because it's a breaking change, but rather just point out that in this case it would be a breaking change the customers would see and have to take action.
Sorry, didn't mean to imply the pain would be less, but rather that perhaps we may want to give ourselves time to think if this is a breaking change we think is worth making or not. Mostly because in this case the customers would feel it and the terraform instance resource is one of the most used ones, and the release date is so soon. |
this makes me realize that since mixed bag, there.. also, @karencfv, re.
this was a behavior that surprised me as well - we actually accept duplicate entries in |
Hm, yeah, that'd wouldn't be great. Especially if an user forgot to upgrade their provider and is managing a large amount of instances 🤔
Ha! TIL. I guess if duplicates have been accepted all along then not a huge issue here 🤷♀️ . It'd probably be nice to have some sort of validation to make sure duplicates are not accepted but, that's probably for another time as it's not really up to this PR to fix. |
} | ||
} | ||
|
||
// if and when `Update` can update other fields, set them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe this has been renamed?
// if and when `Update` can update other fields, set them | |
// if and when `Reconfigure` can update other fields, set them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal mechanics of this all look great to me (and I've got a good sense for what I'd need to do to remake #6321 on the back of the update API you've added here). Thanks for putting all the time in on this one!
.await | ||
.map_err(ActionError::action_failed)?; | ||
|
||
// If there was a boot disk, clear it. If there was not a boot disk, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be too clever to try to look in the params
to see if a boot disk was specified? Saves us from having to do the extra database calls to unset something we never bothered to set.
I'm surprised we allow this, and additionally very surprised this works - I would have thought that the activations for multiply attached volumes would stomp all over each other (see https://rfd.shared.oxide.computer/rfd/0177#_activation for more on this). Was this tested with a real instance? |
not at the time, but at least through the control plane on my workstation it seems to work:
i'm quite certain if the duplicate entries were to create new disks it would fail for reasons like you indicate (if nothing else, duplicate names should prevent rows from being created, let alone getting to any real Crucibles). but for attachment, it looks to me like we just end up thinking the already-successful attach is unremarkable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the huge effort to get this feature rolled out! The user facing side of the API looks great to me!
I'd like to test this before this gets merged if that's ok, I'll do it first thing tomorrow morning. |
Yep, sending a POST body with multiple of the same attach statement doesn't make its way to propolis, I'd be willing to bet the code you highlighted is what's responsible. So: surprising that Nexus accepts that POST body but we do the right thing. 🚢 |
Greg and Eliza were both right in comments on #6585, but since these are both fully internal I didn't want to add another CI round trip there :)
Greg and Eliza were both right in comments on #6585, but since these are both fully internal I didn't want to add another CI round trip there :)
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
in #6585 i'd unintentionally removed the fragment from `bhyve_api`'s source entry in Cargo.lock. it's not _wrong_, since it's still specifying the desired `11371b0...`, but without the fragment at the end of the source Cargo parses this as a [non-precise reference](https://github.com/rust-lang/cargo/blob/d9c14e664e994ea5cf28e49000231a4b2734d8e7/src/cargo/core/source_id.rs#L158-L165) and updates the git repo tracking this dependency every time it resolves dependdencies. so, put the fragment back and be very clear to Cargo that once it has `11371b0...` it does not need to do more git operations to try getting a newer version of the commit. (this makes a bit more sense if you imagine source urls like `git+https://github.com/foo.git?rev=main#1234`: in the case where `rev` _is_ a commit it's arguably a bug to fetch from remotes when you already have the commit..) Cargo generally handles this automatically, compare with [other version bumps](2c79661). i suspect i got the lockfile in this state by resolving a merge conflict incorrectly.
This commit extends the `instance-reconfigure` API endpoint added in #6585 to also allow setting instance auto-restart policies (as added in #6503). I've also added the actual auto-restart policy to the external API instance view, along with the boolean `auto_restart_enabled` added in #6503. This way, it's possible to change just the boot disk by providing the current auto-restart policy in an instance POST.
the Omicron side of adding explicit boot order selection to instances (counterpart to propolis#756).
first, this extends
params::InstanceCreate
to take a newboot_disk: Option<params::InstanceDiskAttachment>
.additionally, this adds a
PUT /v1/instances/{instance}
to update instances. the only property that can be updated at the moment isboot_disk
, pick a new boot disk or unset it entirely. this also partially subsumes #6321.finally, this updates Omicron to reference a recent enough Propolis that #756 is included.
a surprising discovery along the way: you can specify a disk to be attached multiple times in
disks
today, when creating an instance, and we're fine with it! this carries through with the newboot_disk
field: if you specify the disk asboot_disk
and indisks
, it is technically listing the disk for attachment twice but this will succeed.