Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

report better error messages after Diesel queries #907

Merged
merged 3 commits into from
Apr 13, 2022
Merged

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Apr 13, 2022

This change is a little gross, but it greatly improves error messages that come from Diesel. When we get an unexpected error, it used to look like this:

{"internal_message": "Unknown diesel error: DeserializationError(Error(Build(Error { expected: 16, found: 20 })))"}

Now it looks like this:

{"internal_message": "Unknown diesel error looking up Instance ById(9e41893f-74bf-4db9-855c-91acbefd71c8): invalid bytes length: expected 16, found 20"}

It's filled in (1) that we're doing a lookup, (2) the kind of resource, (3) how we're doing the lookup (by this specific id), and (4) it's stringified the Diesel error rather than showing the debug output.


I got here because I was looking more closely at a problem @bnaecker mentioned in chat the other day. Say you apply this change to schema.rs, which looks innocuous:

diff --git a/nexus/src/db/schema.rs b/nexus/src/db/schema.rs
index 92395ada..7fccc158 100644
--- a/nexus/src/db/schema.rs
+++ b/nexus/src/db/schema.rs
@@ -72,8 +72,8 @@ table! {
         time_state_updated -> Timestamptz,
         state_generation -> Int8,
         active_server_id -> Uuid,
-        active_propolis_id -> Uuid,
         active_propolis_ip -> Nullable<Inet>,
+        active_propolis_id -> Uuid,
         target_propolis_id -> Nullable<Uuid>,
         migration_id -> Nullable<Uuid>,
         ncpus -> Int8,

A whole bunch of integration tests seem to hang -- at least, a dozen or so were still running after several minutes. If you run a test individually, it fails due to oxidecomputer/steno#26:

$ cargo test -p omicron-nexus --test=test_all -- --nocapture integration_tests::instances::test_attach_one_disk_to_instance
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 08s
     Running tests/test_all.rs (target/debug/deps/test_all-9208224595adfa34)

running 1 test
log file: "/dangerzone/omicron_tmp/test_all-9208224595adfa34-test_attach_one_disk_to_instance.16911.0.log"
note: configured to log to "/dangerzone/omicron_tmp/test_all-9208224595adfa34-test_attach_one_disk_to_instance.16911.0.log"
thread 'integration_tests::instances::test_attach_one_disk_to_instance' panicked at 'called `Result::unwrap()` on an `Err` value: action failed', /home/dap/.cargo/git/checkouts/steno-adfbaf4765a4f9e1/578498b/src/saga_exec.rs:1014:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

and then hangs there. The only clue Ben and I found was in the log:

[2022-04-13T04:19:38.259670603Z] DEBUG: 23f21ed5-c962-4874-8e94-f034b4f0d019/ServerContext/16911 on ivanova: recording saga event (node_id=24, saga_id=47114f13-f67d-4f48-940a-f381b8d59334)
    event_type: Failed(ActionFailed { source_error: Object({"InternalError": Object({"internal_message": String("Unknown diesel error looking up Instance ById(9e41893f-74bf-4db9-855c-91acbefd71c8): invalid bytes length: expected 16, found 20")})}) })

Note that that's the error message after the changes in this PR. Before those changes, it looks like this:

[2022-04-13T04:23:18.602154763Z] DEBUG: 6cb8bbfc-9198-4011-988d-ae22c99333e0/ServerContext/16955 on ivanova: recording saga event (node_id=24, saga_id=bca37bef-3f1b-4a86-89d3-aff3b44de4c7)
    event_type: Failed(ActionFailed { source_error: Object({"InternalError": Object({"internal_message": String("Unknown diesel error: DeserializationError(Error(Build(Error { expected: 16, found: 20 })))")})}) })

As I mentioned, this change is kind of ugly. It's using the "NotFound" error that we would generate if the row wasn't found to figure out what we were trying to do. Ugh.

I'm thinking of adding something to the ApiResourceError trait that returns a (ResourceType, LookupType) tuple instead. I think this is easy to do, since each authz type knows its resource type and stores its lookup_type. This way we don't have to infer from the ObjectNotFound (and write code for the impossible case that it's not an ObjectNotFound error.

Another thing I considered: can the lookup API itself produce a better message here? Maybe? But it'd be kind of weird because it uses public_error_from_diesel_pool(), which has all this existing logic for picking apart what happened. Then we'd be taking the result of that, picking that apart in a similar way, and if it's an InternalError, we'd be augmenting it with more context. (A useful helper here might be some facility for taking a public Error and if it's an InternalError, augmenting the message with more context. Similar to anyhow's context() function. Something like: do_something_that_might_fail.with_internal_context(|| format!("looking up thing with id {}", id)). This would modify the error only if it was an Error::InternalError. Maybe this is worth looking into?)

@davepacheco davepacheco marked this pull request as draft April 13, 2022 04:49
Comment on lines +223 to +224
"Unknown diesel error creating {:?} called {:?}: {:#}",
resource_type, object_name, error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did these lines end up being useful? I think the change is probably good regardless, so I'm just curious if you ran into another opaque error message that this helped resolve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't wind up using these.

error => {
let context = match make_not_found_error() {
PublicError::ObjectNotFound { type_name, lookup_type } => {
format!("looking up {:?} {:?}", type_name, lookup_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, but is it true that this error is only hit when "looking up" a resource? I think your suggestion of a trait that provides an additional context method is really attractive, since it would allow us to provide this string or some other information from the caller.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends on what you mean by "lookup". I believe it's always used in a context where we expect a specific object to exist and it doesn't. Sometimes that's because we're trying to update it or delete it, not fetch it. I could make the language more general ("accessing object") to avoid confusion.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a qualitative improvement to me, thanks for digging into it! I agree that some trait for providing context would be nice. I'm also curious if we could just attach entire stack traces or something similar to the internal error messages. Those will only show up in logs, so the extra information would probably not be overkill. It might be too noisy though, I'm not sure.

In any case, +1 from me on this change.

@davepacheco
Copy link
Collaborator Author

I wound up making a pretty small change that eliminates the grossness. There's still room for improvement in terms of the context carried around by errors!

@davepacheco davepacheco marked this pull request as ready for review April 13, 2022 18:36
@davepacheco davepacheco enabled auto-merge (squash) April 13, 2022 18:37
@davepacheco davepacheco merged commit 71a5326 into main Apr 13, 2022
@davepacheco davepacheco deleted the diesel-errors branch April 13, 2022 19:14
leftwo pushed a commit that referenced this pull request Sep 11, 2023
Crucible:
update rust crate base64 to 0.21.3 (#913)
update rust crate slog-async to 2.8 (#915)
update rust crate async-recursion to 1.0.5 (#912)
Move active jobs into a separate data structure and optimize `ackable_work` (#908)
Check repair IDs correctly (#910)
update actions/checkout action to v4 (#903)
update rust crate tokio to 1.32 (#890)
Remove single-item Vecs (#898)
Move "extent under repair" into a helper function (#907)
offset_mod shouldn't be randomized (#905)
Only rehash if a write may have failed (#899)
Make negotiation state an enum (#901)
Test update for fast write ack and gather errors on test failure (#897)

Propolis:
Update cpuid-gen util for better coverage
Make storage backend config more flexible and consistent
Use correct register sizes for PIIX3 PM device
Update bitflags dependency
fix softnpu port order (#517)
Use hex formatting for unhandled MMIO/PIO/MSRs
Update deps for new crucible and oximeter
Update standalone-with-crucible docs (#514)
leftwo added a commit that referenced this pull request Sep 12, 2023
Crucible:
update rust crate base64 to 0.21.3 (#913)
update rust crate slog-async to 2.8 (#915)
update rust crate async-recursion to 1.0.5 (#912)
Move active jobs into a separate data structure and optimize
`ackable_work` (#908) Check repair IDs correctly (#910)
update actions/checkout action to v4 (#903)
update rust crate tokio to 1.32 (#890)
Remove single-item Vecs (#898)
Move "extent under repair" into a helper function (#907) offset_mod
shouldn't be randomized (#905)
Only rehash if a write may have failed (#899)
Make negotiation state an enum (#901)
Test update for fast write ack and gather errors on test failure (#897)

Propolis:
Update cpuid-gen util for better coverage
Make storage backend config more flexible and consistent Use correct
register sizes for PIIX3 PM device
Update bitflags dependency
fix softnpu port order (#517)
Use hex formatting for unhandled MMIO/PIO/MSRs
Update deps for new crucible and oximeter
Update standalone-with-crucible docs (#514)

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants