Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding cancelcontext to closing vmbus channels #734

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

erfrimod
Copy link

@erfrimod erfrimod commented Jan 28, 2025

shutdown_vf_devices(), the method where nic_shutdown() is called, doesn't have a CancelContext. So the wait for vmbus operations during unload_for_servicing() is unbounded. We have seen shutdown_mana() wait indefinitely.

shutdown_pci_devices() returns a ShutdownError. And something similar could be put in place for shutdown_vf_devices().

Note: There may be a hang in the future if the vmbus stays unresponsive. At best this change allows us to return an error from shutdown_vf_devices() so we know where in shutdown_mana() the delay is coming from.

DONE:

  • CancelContext added.
  • Timouts return a "deadline exceeded" error that gets traced. unload_for_servicing does not get stuck.

TBD:

  • This change causes a crash in hcl::ioctl::ProcessorRunner::run trying to unwrap a None DEFERRED_ACTION. I've added code for testing to ignore None values, but that is not a proper fix.
  • Completing shutdown_vf_devices without closing the vmbus channels leaves the VM is left in a bad state. There are a lot of vmbus disk errors. And the VM is unable to shutdown when 'Stop-VM' is run.

Testing with a 1 ms duration fails, but leaves VM in a bad state

The vm sees a lot of disk errors on the vmbus. And Stop-Vm gets stuck.

Output from ohcldiag:

Invoking VTL2 reload.
ERROR: ErrorCode: 32768
ERROR: Message: 'UhVm' failed to reload management VTL at stage: SaveManagementVtlState with error code 0xC0370802 (Management VTL failed to save itself.). (Virtual machine ID 97512202-0EC0-4DD7-8FDF-514461B84936). Management VTL image file: 'custom'. Old management VTL image version: '0.0.0.0'. New management VTL image version: '0.0.0.0'. Servicing type: Explicit. Correlation ID: '3F112AA2-9BAC-421D-BE80-676FC979BEE2'.
ERROR: Failed to reload VTL2. Error = 10
ERROR: Operation failed.

Output from kmsg log:

[77.332774] state_unit: INFO servicing_save_vtl2{ correlation_id=4326c7d8-5107-403f-bdd9-613335b93496}:save_units:state_change{ operation="save"}:  state change complete duration=11.698098ms
[77.334631] vmbus_server: INFO  revoking channel offer_id=OfferId(0)
[77.334697] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xa
[77.334809] vmbus_server: INFO  revoking channel offer_id=OfferId(e)
[77.338344] underhill_core::worker: ERROR servicing_save_vtl2{ correlation_id=4326c7d8-5107-403f-bdd9-613335b93496}:shutdown_mana:  Error closing vmbus channels: cancelled waiting for vmbus channel close<0a><0a>Caused by:<0a>    deadline exceeded<0a><0a>Stack backtrace:<0a>
[77.338636] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xf
[77.338775] vmbus_server: INFO  revoking channel offer_id=OfferId(f)
[77.338825] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x10
[77.338908] vmbus_server: INFO  revoking channel offer_id=OfferId(10)
[77.338953] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x11
[77.339187] mana_driver::resources: ERROR  leaking resources
[77.342529] underhill_core::emuplat::netvsp: ERROR  Failed to disconnect endpoint err=channel closed
[77.356568] underhill_core::dispatch: ERROR  error while handling servicing error=cancelled waiting for vmbus channel close: deadline exceeded

Output on the VM:

ubuntu2004-server login: [   65.607844] blk_update_request: I/O error, dev sda, sector 36771256 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
[   69.667717] blk_update_request: I/O error, dev sda, sector 3338456 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[   69.768075] Buffer I/O error on device dm-0, logical block 23579

Testing with a 1 second duration succeeds normally

Output from ohcldiag:

Invoking VTL2 reload.
Reloading VTL2 completed without any errors.

Output from kmsg log:

[103.120834] state_unit: INFO servicing_save_vtl2{ correlation_id=fa2b372b-d10c-41fa-bdc2-ec74692653ac}:save_units:state_change{ operation="save"}:  state change complete duration=25.74455ms
[103.125589] vmbus_server: INFO  revoking channel offer_id=OfferId(0)
[103.125777] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xa
[103.126055] vmbus_server: INFO  revoking channel offer_id=OfferId(e)
[103.126211] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xf
[103.126445] vmbus_server: INFO  revoking channel offer_id=OfferId(f)
[103.126584] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x10
[103.126831] vmbus_server: INFO  revoking channel offer_id=OfferId(10)
[103.126980] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x11
[103.139154] underhill_core::emuplat::netvsp: INFO  Network endpoint disconnected

actions.copy_to_slots(&mut slots, self.hcl);
} else {
actions.run_actions(self.hcl);
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for None because in testing I was hitting a crash here:

[78.547532] state_unit: INFO servicing_save_vtl2{ correlation_id=b4a14652-340a-4fee-9707-307701aaabf8}:save_units:state_change{ operation="save"}:  state change complete duration=32.541219ms
[78.552453] mana_driver::resources: ERROR  leaking resources
[78.553500] vmbus_server: INFO  revoking channel offer_id=OfferId(0)
[78.553635] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xa
[78.553847] vmbus_server: INFO  revoking channel offer_id=OfferId(e)
[78.553945] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0xf
[78.554098] vmbus_server: INFO  revoking channel offer_id=OfferId(f)
[78.554194] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x10
[78.554349] vmbus_server: INFO  revoking channel offer_id=OfferId(10)
[78.554442] vmbus_server::channels: INFO  rescinding channel from guest channel_id=0x11
[78.585466] underhill_core::dispatch: ERROR  error while handling servicing error=cancelled waiting for vmbus channel close: deadline exceeded
[78.589991] vmbus_relay_intercept_device: INFO  matching channel offered offer=OfferInfo { offer: OfferChannel { interface_id: 0e0b6031-5213-4934-818b-38d90ced39db, instance_id: b6650ff7-33bc-4840-8048-e0676786f393, rsvd: [0, 0, 0, 0], flags: OfferFlags { enumerate_device_interface: true, confidential_ring_buffer: false, confidential_external_memory: false, named_pipe_mode: true, tlnpi_provider: false }, mmio_megabytes: 0, user_defined: UserDefinedData([4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), subchannel_index: 0, mmio_megabytes_optional: 0, channel_id: ChannelId(7), monitor_id: ff, monitor_allocated: 0, is_dedicated: 1, connection_id: 2007 }, request_se
[78.590006] vmbus_relay_intercept_device: nd: SenderCore(ManuallyDrop { value: Queue { remote: OnceLock(<uninit>), local: Mutex { data: LocalQueue { messages: ErasedVecDeque { cap: 280, head: 0, len: 0 }, ports: [], waker: Some(Waker { data: 0x57170180c50, vtable: 0x7fe86c02e1b8 }), remote: false, receiver_gone: false, remove_closed: false, new_handler: 0x7fe86b87d330 } } } }), response_recv: ReceiverCore { queue: ReceiverQueue(Queue { remote: OnceLock(<uninit>), local: Mutex { data: LocalQueue { messages: ErasedVecDeque { cap: 10, head: 4, len: 0 }, ports: [], waker: Some(Waker { data: 0x571700c2bb0, vtable: 0x7fe86c00d5a8 }), remote: false, receiver_gone: false, remove_closed: false, new_handler: 0x7fe86b87d330 } } }), ports: PortHandlerList([]), terminated: false } }
[78.594794] [U] task 'o
[78.594805] [U] called `Option::unwrap()` on a `None` value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this. Should be fixed by #739.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants