-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add external API to resize a stopped instance #6321
Conversation
let updated = diesel::update(dsl::instance) | ||
.filter(dsl::time_deleted.is_null()) | ||
.filter(dsl::id.eq(instance_id)) | ||
.filter(dsl::state.eq_any(ALLOWED_RESIZE_STATES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to ask if we should also check for the propolis_id
being null here, but I suppose we can rely on the check constraint that the instance can only be in NoVmm
if it's null. Maybe worth commenting here?
external_state | ||
))) | ||
} else { | ||
Err(Error::internal_error("failed to resize instance")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this case only happens if the instance's time_deleted
is not null. Maybe this should also be a "not found" error, rather than 500ing? I think we generally treat deleted objects the same as nonexistent objects, right?
.unwrap() | ||
.parsed_body::<Instance>() | ||
.unwrap(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also check that if we get the instance, then it has the expected shape.
) | ||
.await?; | ||
|
||
self.db_datastore.instance_fetch_with_vmm(opctx, &authz_instance).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refetch is kind of sketchy (the instance's shape can change again between the resize query and the fetch). We may not actually need it: since the query is conditional on the instance being halted (i.e. being in NoVmm
, or possibly in Failed
once the failed-instance rework lands), there should be no VMM to influence the instance's state, which means we can construct the correct external instance state without having to consult the VMM table to get the active VMM's state. But that's itself kind of sketchy: the rule everywhere else is that you use an InstanceAndActiveVmm
to get an external-API Instance
to return, and I think this would be the first exception to that rule.
Another, simpler option would be to have this API return 204 No Content on success. (The caller knows what the updated instance's CPU and memory sizes are because it just set them, and nothing else about the instance is different.) OTOH several other instance routines (stop, reboot, etc.) return the posterior state of the instance, so maybe it's better to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 204 No Content approach makes less sense in view of the discussion below (where we've reached a consensus that this API should like other update APIs, which return the posterior states of the things being updated). I think it's probably reasonable to construct an InstanceAndActiveVmm
in the relevant datastore routine and return it here (though this should have some explanatory comments as @hawkw mentioned above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to this! It'll be a nice addition to the tf provider as well :)
@@ -2378,6 +2378,62 @@ | |||
} | |||
} | |||
}, | |||
"/v1/instances/{instance}/resize": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason the resize action should be separated from other updateable fields (e.g. name
) into it's own endpoint?
Other endpoints generally follow a pattern for "update" where all updateable fields of a resource are gathered into a single endpoint (e.g. VpcUpdate
and ProjectUpdate
) for consistency. This pattern makes it a bit easier to avoid endpoint explosion to keep the number of endpoints as small as possible and easier for our users to grok.
Would it be possible to change this endpoint to follow that pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s precedent here if you think of resize as an instance action like start or stop:
omicron/nexus/tests/output/nexus_tags.txt
Lines 64 to 65 in 9d8078b
instance_start POST /v1/instances/{instance}/start | |
instance_stop POST /v1/instances/{instance}/stop |
I think it is not necessarily surprising. We don’t have an instance update endpoint, so nobody can be confused about what goes in there vs. these dedicated ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps those two endpoints should be rolled up into a single update endpoint like the others? 🤔
It may be confusing for users to have some resources with endpoints that take an "update object" and others that have several endpoints to update different parts of a resource. It'd be nice to have some sort of consistency when possible. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @gjcolombo! Looks like we were typing at the same time and my answer above looks a bit weird now 😄
I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?
Yeah! I think this is a good compromise. I don't see users changing the name or description of an instance often anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting
, stopping
) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running
, for example), which is how PUTs are supposed to work.
Regarding requiring the instance to be stopped for all updates, whether the field really needs it or not, I agree that’s fine and easier to understand as a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting, stopping) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running, for example), which is how PUTs are supposed to work.
That makes sense, I'm OK with keeping start/stop separate then
) the Omicron side of adding explicit boot order selection to instances (counterpart to [propolis#756](oxidecomputer/propolis#756)). first, this extends `params::InstanceCreate` to take a new `boot_disk: Option<params::InstanceDiskAttachment>`. additionally, this adds a `PUT /v1/instances/{instance}` to update instances. the only property that can be updated at the moment is `boot_disk`, pick a new boot disk or unset it entirely. this also partially subsumes #6321. finally, this updates Omicron to reference a recent enough Propolis that #756 is included. a surprising discovery along the way: you can specify a disk to be attached multiple times in `disks` today, when creating an instance, and we're fine with it! this carries through with the new `boot_disk` field: if you specify the disk as `boot_disk` and in `disks`, it is technically listing the disk for attachment twice but this will succeed.
gonna close this in favor of the definitely-a-derived-work #6774. a lot of the conversation here helped influence how the instance update endpoint behaves, but otherwise i think i've carried over or resolved open items here in the process. |
Add an API to change the CPU/memory sizes for an existing instance (provided it's halted--no CPU hot-add or hot-remove for now, sorry).
Putting this in draft for now since I still want to manually test it (and I'd also like to refine the integration tests a bit).
Fixes #3769.