add external API to resize a stopped instance #6321

gjcolombo · 2024-08-13T23:37:22Z

Add an API to change the CPU/memory sizes for an existing instance (provided it's halted--no CPU hot-add or hot-remove for now, sorry).

Putting this in draft for now since I still want to manually test it (and I'd also like to refine the integration tests a bit).

Fixes #3769.

hawkw · 2024-08-13T23:44:13Z

nexus/db-queries/src/db/datastore/instance.rs

+        let updated = diesel::update(dsl::instance)
+            .filter(dsl::time_deleted.is_null())
+            .filter(dsl::id.eq(instance_id))
+            .filter(dsl::state.eq_any(ALLOWED_RESIZE_STATES))


I was going to ask if we should also check for the propolis_id being null here, but I suppose we can rely on the check constraint that the instance can only be in NoVmm if it's null. Maybe worth commenting here?

hawkw · 2024-08-13T23:45:30Z

nexus/db-queries/src/db/datastore/instance.rs

+                        external_state
+                    )))
+                } else {
+                    Err(Error::internal_error("failed to resize instance"))


IIUC this case only happens if the instance's time_deleted is not null. Maybe this should also be a "not found" error, rather than 500ing? I think we generally treat deleted objects the same as nonexistent objects, right?

gjcolombo · 2024-08-13T23:39:28Z

nexus/tests/integration_tests/instances.rs

+    .unwrap()
+    .parsed_body::<Instance>()
+    .unwrap();
+


This should also check that if we get the instance, then it has the expected shape.

gjcolombo · 2024-08-13T23:49:18Z

nexus/src/app/instance.rs

+            )
+            .await?;
+
+        self.db_datastore.instance_fetch_with_vmm(opctx, &authz_instance).await


This refetch is kind of sketchy (the instance's shape can change again between the resize query and the fetch). We may not actually need it: since the query is conditional on the instance being halted (i.e. being in NoVmm, or possibly in Failed once the failed-instance rework lands), there should be no VMM to influence the instance's state, which means we can construct the correct external instance state without having to consult the VMM table to get the active VMM's state. But that's itself kind of sketchy: the rule everywhere else is that you use an InstanceAndActiveVmm to get an external-API Instance to return, and I think this would be the first exception to that rule.

Another, simpler option would be to have this API return 204 No Content on success. (The caller knows what the updated instance's CPU and memory sizes are because it just set them, and nothing else about the instance is different.) OTOH several other instance routines (stop, reboot, etc.) return the posterior state of the instance, so maybe it's better to be consistent.

The 204 No Content approach makes less sense in view of the discussion below (where we've reached a consensus that this API should like other update APIs, which return the posterior states of the things being updated). I think it's probably reasonable to construct an InstanceAndActiveVmm in the relevant datastore routine and return it here (though this should have some explanatory comments as @hawkw mentioned above).

karencfv

Looking forward to this! It'll be a nice addition to the tf provider as well :)

karencfv · 2024-08-14T00:23:54Z

openapi/nexus.json

@@ -2378,6 +2378,62 @@
        }
      }
    },
+    "/v1/instances/{instance}/resize": {


Is there a specific reason the resize action should be separated from other updateable fields (e.g. name) into it's own endpoint?

Other endpoints generally follow a pattern for "update" where all updateable fields of a resource are gathered into a single endpoint (e.g. VpcUpdate and ProjectUpdate) for consistency. This pattern makes it a bit easier to avoid endpoint explosion to keep the number of endpoints as small as possible and easier for our users to grok.

Would it be possible to change this endpoint to follow that pattern?

There’s precedent here if you think of resize as an instance action like start or stop:

omicron/nexus/tests/output/nexus_tags.txt

Lines 64 to 65 in 9d8078b

instance_start POST /v1/instances/{instance}/start

instance_stop POST /v1/instances/{instance}/stop

I think it is not necessarily surprising. We don’t have an instance update endpoint, so nobody can be confused about what goes in there vs. these dedicated ones.

I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?

Perhaps those two endpoints should be rolled up into a single update endpoint like the others? 🤔

It may be confusing for users to have some resources with endpoints that take an "update object" and others that have several endpoints to update different parts of a resource. It'd be nice to have some sort of consistency when possible. What do you think?

Sorry @gjcolombo! Looks like we were typing at the same time and my answer above looks a bit weird now 😄

I think we could have a single instance-update endpoint, but we'd have to decide what to do about the distinction between fields that you can theoretically update on a running instance (name, description) vs. the ones that require the instance to be halted (CPUs, memory, hostname, public keys, user data, etc.). My initial inclination would be to say that all updates (even the "safe" ones) require the instance to be halted, and if it turns out that that's a huge headache for users, we can pivot to allowing updates on running instances if they only touch safe-when-running fields. (This is more complicated to do on the backend, but I don't think it's intractable.) WDYT?

Yeah! I think this is a good compromise. I don't see users changing the name or description of an instance often anyway

I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting, stopping) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running, for example), which is how PUTs are supposed to work.

Regarding requiring the instance to be stopped for all updates, whether the field really needs it or not, I agree that’s fine and easier to understand as a user.

I could see that working for the fields Greg lists, but not for start and stop and reboot. The problem with the latter is that they take time to complete asynchronously, with intermediate states (starting, stopping) before the final state, so the PUT can’t return you the resource in a state matching what you sent it (running, for example), which is how PUTs are supposed to work.

That makes sense, I'm OK with keeping start/stop separate then

) the Omicron side of adding explicit boot order selection to instances (counterpart to [propolis#756](oxidecomputer/propolis#756)). first, this extends `params::InstanceCreate` to take a new `boot_disk: Option<params::InstanceDiskAttachment>`. additionally, this adds a `PUT /v1/instances/{instance}` to update instances. the only property that can be updated at the moment is `boot_disk`, pick a new boot disk or unset it entirely. this also partially subsumes #6321. finally, this updates Omicron to reference a recent enough Propolis that #756 is included. a surprising discovery along the way: you can specify a disk to be attached multiple times in `disks` today, when creating an instance, and we're fine with it! this carries through with the new `boot_disk` field: if you specify the disk as `boot_disk` and in `disks`, it is technically listing the disk for attachment twice but this will succeed.

iximeow · 2024-10-04T17:47:48Z

gonna close this in favor of the definitely-a-derived-work #6774. a lot of the conversation here helped influence how the instance update endpoint behaves, but otherwise i think i've carried over or resolved open items here in the process.

gjcolombo added 6 commits August 13, 2024 21:49

add instance_resize API

a4155cb

fix path to resize API

dbf0490

integration test for resize API

21598f1

validate resize params

2eed625

fix endpoint coverage test

6603d16

fmt

43ef6ef

hawkw reviewed Aug 13, 2024

View reviewed changes

gjcolombo commented Aug 13, 2024

View reviewed changes

karencfv reviewed Aug 14, 2024

View reviewed changes

iximeow mentioned this pull request Sep 17, 2024

Add a boot device field to Instance, forward along as appropriate #6585

Merged

david-crespo mentioned this pull request Sep 18, 2024

Boot order UI oxidecomputer/console#2335

Closed

iximeow mentioned this pull request Oct 4, 2024

allow instance_reconfigure to resize a stopped instance #6774

Open

iximeow closed this Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add external API to resize a stopped instance #6321

add external API to resize a stopped instance #6321

gjcolombo commented Aug 13, 2024

hawkw Aug 13, 2024

hawkw Aug 13, 2024

gjcolombo Aug 13, 2024

gjcolombo Aug 13, 2024

gjcolombo Aug 14, 2024

karencfv left a comment

karencfv Aug 14, 2024

david-crespo Aug 14, 2024 •

edited

Loading

gjcolombo Aug 14, 2024

karencfv Aug 14, 2024

karencfv Aug 14, 2024

david-crespo Aug 14, 2024

karencfv Aug 14, 2024

iximeow commented Oct 4, 2024

	instance_start POST /v1/instances/{instance}/start
	instance_stop POST /v1/instances/{instance}/stop

add external API to resize a stopped instance #6321

add external API to resize a stopped instance #6321

Conversation

gjcolombo commented Aug 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karencfv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-crespo Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iximeow commented Oct 4, 2024

david-crespo Aug 14, 2024 •

edited

Loading