Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sled-agent] PUT zones doesn't create datasets #7006

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

smklein
Copy link
Collaborator

@smklein smklein commented Nov 7, 2024

Calls to PUT zones will now check that datasets exist -- as they should be initialized, with prior calls -- rather than
forcefully creating them.

This PR checks that datasets exist in the "configuration", rather than querying inventory.

Fixes #6991

.storage
.upsert_filesystem(dataset_id, dataset_name)
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#6991 notes

(Or, at least, replacing it with a check that these datasets exist)

I think we should definitely do that; this should immediately reject the request if it knows up front that not all the zones can be started due to missing datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm on board with that, mostly just wanted to get a quick PR up to see if this would pass RSS and bring the system up, as it should.

Happy to make that change before getting this out of draft.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2b73938

@smklein smklein marked this pull request as ready for review November 8, 2024 22:04
Base automatically changed from rss-datasets to main November 12, 2024 18:23
@@ -5039,6 +5101,18 @@ mod illumos_tests {
.await
.expect("Failed to ensure disks");
assert!(!result.has_error(), "{:?}", result);
let result = harness
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary now because we want to read the dataset ledger for our service tests, even if it's empty.

@@ -914,7 +915,11 @@ impl StorageManager {
let result = self.datasets_ensure_internal(&log, &config).await;

let ledger_data = ledger.data_mut();
if *ledger_data == config {
if had_old_ledger && *ledger_data == config {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is only tangentially related to this PR, but it's necessary for some tests.

Because we're reading the ledger of datasets when we provision zones, we were failing in tests that had not created a ledger of datasets.

To mitigate: when we're creating the "empty dataset ledger", we want to still write it to our simulated M.2s.

However, in the old version of this code, as an optimization, we would skip the re-write to the M.2 if nothing had changed. Unfortunately, if there was no prior ledger, we would initialize the ledger data with a "Default value", meaning that it wasn't actually possible to write the "default value" of the ledger to storage.

(This hadn't been an issue before, because our code typically has datasets it cares about inserting when this method gets called).

Anyway, to mitigate, I'm now skipping the ledger re-write if and only if: (1) There was an old ledger, and (2) it hasn't changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably obscure enough it warrants a (short) comment, and might be worth inverting the boolean to make it easier to understand? Something like

// Commit the requested ledger if we don't have one on disk or the
// one on disk is now out of date.
if !had_old_ledger || *ledger_data != config {
    *ledger_data = config;
    ledger.commit().await?;
}

Ok(result)

@@ -914,7 +915,11 @@ impl StorageManager {
let result = self.datasets_ensure_internal(&log, &config).await;

let ledger_data = ledger.data_mut();
if *ledger_data == config {
if had_old_ledger && *ledger_data == config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably obscure enough it warrants a (short) comment, and might be worth inverting the boolean to make it easier to understand? Something like

// Commit the requested ledger if we don't have one on disk or the
// one on disk is now out of date.
if !had_old_ledger || *ledger_data != config {
    *ledger_data = config;
    ledger.commit().await?;
}

Ok(result)

@@ -1181,7 +1187,7 @@ impl StorageManager {
self.omicron_physical_disks_ensure_internal(&log, &config).await?;

let ledger_data = ledger.data_mut();
if *ledger_data == config {
if had_old_ledger && *ledger_data == config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note about commenting and maybe inverting the boolean


// These datasets are configured to be part of the control plane.
let datasets_config = self.inner.storage.datasets_config_list().await?;
let existing_datasets: HashSet<_> = datasets_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny nit - maybe BTreeSet here so if a client sends the same request multiple times, the error will order the missing datasets the same way each time? (Might require line 3490 to be a BTreeSet too, which seems fine.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I commented on the same thing but on line 3490, haha. :)

}

if let Some(pool) = zone.filesystem_pool.as_ref() {
requested_datasets.insert(DatasetName::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a helper on zone? I'm wondering if in the future we'll store more than just the zpool in zone, at which point it would be nice to know clients of zone aren't constructing more detailed dataset info based on just the zpool.

Storage(#[from] sled_storage::error::Error),

#[error("Missing datasets: {datasets:?}")]
MissingDatasets { datasets: HashSet<DatasetName> },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the contents of the set will be printed out (e.g. in logs), perhaps it should be a BTreeSet rather than a HashSet, for consistent ordering? If I see a logged error like "missing datasets: {a, b, c}", it seems like it could be desirable to be able to grep for other occurrences of the same missing datasets error, which is difficult when the iteration order isn't stable...also, if they're always alphabetical, it's easier to write grep invocations or other regexes where you match any missing datasets error that includes at least some prefix/suffix of datasets.

Probably not a _huge_deal, but I figured I'd mention it.


// These datasets are configured to be part of the control plane.
let datasets_config = self.inner.storage.datasets_config_list().await?;
let existing_datasets: HashSet<_> = datasets_config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I commented on the same thing but on line 3490, haha. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PUT /omicron-zone -- stop provisioning datasets!
3 participants