report better error messages after Diesel queries #907

davepacheco · 2022-04-13T04:27:43Z

This change is a little gross, but it greatly improves error messages that come from Diesel. When we get an unexpected error, it used to look like this:

{"internal_message": "Unknown diesel error: DeserializationError(Error(Build(Error { expected: 16, found: 20 })))"}

Now it looks like this:

{"internal_message": "Unknown diesel error looking up Instance ById(9e41893f-74bf-4db9-855c-91acbefd71c8): invalid bytes length: expected 16, found 20"}

It's filled in (1) that we're doing a lookup, (2) the kind of resource, (3) how we're doing the lookup (by this specific id), and (4) it's stringified the Diesel error rather than showing the debug output.

I got here because I was looking more closely at a problem @bnaecker mentioned in chat the other day. Say you apply this change to schema.rs, which looks innocuous:

diff --git a/nexus/src/db/schema.rs b/nexus/src/db/schema.rs
index 92395ada..7fccc158 100644
--- a/nexus/src/db/schema.rs
+++ b/nexus/src/db/schema.rs
@@ -72,8 +72,8 @@ table! {
         time_state_updated -> Timestamptz,
         state_generation -> Int8,
         active_server_id -> Uuid,
-        active_propolis_id -> Uuid,
         active_propolis_ip -> Nullable<Inet>,
+        active_propolis_id -> Uuid,
         target_propolis_id -> Nullable<Uuid>,
         migration_id -> Nullable<Uuid>,
         ncpus -> Int8,

A whole bunch of integration tests seem to hang -- at least, a dozen or so were still running after several minutes. If you run a test individually, it fails due to oxidecomputer/steno#26:

$ cargo test -p omicron-nexus --test=test_all -- --nocapture integration_tests::instances::test_attach_one_disk_to_instance
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 08s
     Running tests/test_all.rs (target/debug/deps/test_all-9208224595adfa34)

running 1 test
log file: "/dangerzone/omicron_tmp/test_all-9208224595adfa34-test_attach_one_disk_to_instance.16911.0.log"
note: configured to log to "/dangerzone/omicron_tmp/test_all-9208224595adfa34-test_attach_one_disk_to_instance.16911.0.log"
thread 'integration_tests::instances::test_attach_one_disk_to_instance' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/dap/.cargo/git/checkouts/steno-adfbaf4765a4f9e1/578498b/src/saga_exec.rs:1014:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and then hangs there. The only clue Ben and I found was in the log:

[2022-04-13T04:19:38.259670603Z] DEBUG: 23f21ed5-c962-4874-8e94-f034b4f0d019/ServerContext/16911 on ivanova: recording saga event (node_id=24, saga_id=47114f13-f67d-4f48-940a-f381b8d59334)
    event_type: Failed(ActionFailed { source_error: Object({"InternalError": Object({"internal_message": String("Unknown diesel error looking up Instance ById(9e41893f-74bf-4db9-855c-91acbefd71c8): invalid bytes length: expected 16, found 20")})}) })

Note that that's the error message after the changes in this PR. Before those changes, it looks like this:

[2022-04-13T04:23:18.602154763Z] DEBUG: 6cb8bbfc-9198-4011-988d-ae22c99333e0/ServerContext/16955 on ivanova: recording saga event (node_id=24, saga_id=bca37bef-3f1b-4a86-89d3-aff3b44de4c7)
    event_type: Failed(ActionFailed { source_error: Object({"InternalError": Object({"internal_message": String("Unknown diesel error: DeserializationError(Error(Build(Error { expected: 16, found: 20 })))")})}) })

As I mentioned, this change is kind of ugly. It's using the "NotFound" error that we would generate if the row wasn't found to figure out what we were trying to do. Ugh.

I'm thinking of adding something to the ApiResourceError trait that returns a (ResourceType, LookupType) tuple instead. I think this is easy to do, since each authz type knows its resource type and stores its lookup_type. This way we don't have to infer from the ObjectNotFound (and write code for the impossible case that it's not an ObjectNotFound error.

Another thing I considered: can the lookup API itself produce a better message here? Maybe? But it'd be kind of weird because it uses public_error_from_diesel_pool(), which has all this existing logic for picking apart what happened. Then we'd be taking the result of that, picking that apart in a similar way, and if it's an InternalError, we'd be augmenting it with more context. (A useful helper here might be some facility for taking a public Error and if it's an InternalError, augmenting the message with more context. Similar to anyhow's context() function. Something like: do_something_that_might_fail.with_internal_context(|| format!("looking up thing with id {}", id)). This would modify the error only if it was an Error::InternalError. Maybe this is worth looking into?)

bnaecker · 2022-04-13T16:54:50Z

nexus/src/db/error.rs

+            "Unknown diesel error creating {:?} called {:?}: {:#}",
+            resource_type, object_name, error


Did these lines end up being useful? I think the change is probably good regardless, so I'm just curious if you ran into another opaque error message that this helped resolve.

No, I didn't wind up using these.

bnaecker · 2022-04-13T16:58:07Z

nexus/src/db/error.rs

+        error => {
+            let context = match make_not_found_error() {
+                PublicError::ObjectNotFound { type_name, lookup_type } => {
+                    format!("looking up {:?} {:?}", type_name, lookup_type)


Maybe a nit, but is it true that this error is only hit when "looking up" a resource? I think your suggestion of a trait that provides an additional context method is really attractive, since it would allow us to provide this string or some other information from the caller.

I guess it depends on what you mean by "lookup". I believe it's always used in a context where we expect a specific object to exist and it doesn't. Sometimes that's because we're trying to update it or delete it, not fetch it. I could make the language more general ("accessing object") to avoid confusion.

bnaecker

Looks like a qualitative improvement to me, thanks for digging into it! I agree that some trait for providing context would be nice. I'm also curious if we could just attach entire stack traces or something similar to the internal error messages. Those will only show up in logs, so the extra information would probably not be overkill. It might be too noisy though, I'm not sure.

In any case, +1 from me on this change.

davepacheco · 2022-04-13T18:36:53Z

I wound up making a pretty small change that eliminates the grossness. There's still room for improvement in terms of the context carried around by errors!

Crucible: update rust crate base64 to 0.21.3 (#913) update rust crate slog-async to 2.8 (#915) update rust crate async-recursion to 1.0.5 (#912) Move active jobs into a separate data structure and optimize `ackable_work` (#908) Check repair IDs correctly (#910) update actions/checkout action to v4 (#903) update rust crate tokio to 1.32 (#890) Remove single-item Vecs (#898) Move "extent under repair" into a helper function (#907) offset_mod shouldn't be randomized (#905) Only rehash if a write may have failed (#899) Make negotiation state an enum (#901) Test update for fast write ack and gather errors on test failure (#897) Propolis: Update cpuid-gen util for better coverage Make storage backend config more flexible and consistent Use correct register sizes for PIIX3 PM device Update bitflags dependency fix softnpu port order (#517) Use hex formatting for unhandled MMIO/PIO/MSRs Update deps for new crucible and oximeter Update standalone-with-crucible docs (#514)

Crucible: update rust crate base64 to 0.21.3 (#913) update rust crate slog-async to 2.8 (#915) update rust crate async-recursion to 1.0.5 (#912) Move active jobs into a separate data structure and optimize `ackable_work` (#908) Check repair IDs correctly (#910) update actions/checkout action to v4 (#903) update rust crate tokio to 1.32 (#890) Remove single-item Vecs (#898) Move "extent under repair" into a helper function (#907) offset_mod shouldn't be randomized (#905) Only rehash if a write may have failed (#899) Make negotiation state an enum (#901) Test update for fast write ack and gather errors on test failure (#897) Propolis: Update cpuid-gen util for better coverage Make storage backend config more flexible and consistent Use correct register sizes for PIIX3 PM device Update bitflags dependency fix softnpu port order (#517) Use hex formatting for unhandled MMIO/PIO/MSRs Update deps for new crucible and oximeter Update standalone-with-crucible docs (#514) Co-authored-by: Alan Hanson <[email protected]>

report better error messages after Diesel queries

9302ffe

davepacheco requested a review from bnaecker April 13, 2022 04:28

fix style

12828af

davepacheco marked this pull request as draft April 13, 2022 04:49

david-crespo approved these changes Apr 13, 2022

View reviewed changes

bnaecker reviewed Apr 13, 2022

View reviewed changes

bnaecker approved these changes Apr 13, 2022

View reviewed changes

clean up implementation

bc738ba

davepacheco marked this pull request as ready for review April 13, 2022 18:36

davepacheco enabled auto-merge (squash) April 13, 2022 18:37

davepacheco merged commit 71a5326 into main Apr 13, 2022

davepacheco deleted the diesel-errors branch April 13, 2022 19:14

leftwo mentioned this pull request Sep 11, 2023

Update Crucible and Propolis to latest #4079

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report better error messages after Diesel queries #907

report better error messages after Diesel queries #907

davepacheco commented Apr 13, 2022 •

edited

Loading

bnaecker Apr 13, 2022

davepacheco Apr 13, 2022

bnaecker Apr 13, 2022

davepacheco Apr 13, 2022

bnaecker left a comment

davepacheco commented Apr 13, 2022

		"Unknown diesel error creating {:?} called {:?}: {:#}",
		resource_type, object_name, error

report better error messages after Diesel queries #907

report better error messages after Diesel queries #907

Conversation

davepacheco commented Apr 13, 2022 • edited Loading

bnaecker Apr 13, 2022

Choose a reason for hiding this comment

davepacheco Apr 13, 2022

Choose a reason for hiding this comment

bnaecker Apr 13, 2022

Choose a reason for hiding this comment

davepacheco Apr 13, 2022

Choose a reason for hiding this comment

bnaecker left a comment

Choose a reason for hiding this comment

davepacheco commented Apr 13, 2022

davepacheco commented Apr 13, 2022 •

edited

Loading