GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests #964

jackschefer-msft · 2025-03-05T21:09:26Z

This change addresses a circular wait loop involving VPCI_DEVICE_CONTROL messages. Specifically, in certain scenarios, serial processing of GET host requests blocks the guest OS in VTL0 from making forward progress.

For example, in order for a VPCI_DEVICE_CONTROL message revoking a VPCI device to complete, the VTL0 OS may be accessing other PCI devices that are themselves proxied over the GET (ex VgaProxyPciReadRequest).

daprilik · 2025-03-05T22:32:34Z

vm/devices/get/guest_emulation_transport/src/process_loop.rs

@@ -479,9 +484,9 @@ pub(crate) struct ProcessLoop<T: RingMem> {
    #[inspect(skip)]
    write_recv: mesh::Receiver<WriteRequest>,
    #[inspect(skip)]
-    igvm_attest_requests: VecDeque<Pin<Box<dyn Future<Output = Result<(), FatalError>> + Send>>>,
+    async_host_requests: VecDeque<Pin<Box<dyn Future<Output = Result<(), FatalError>> + Send>>>,


shouldn't we have a queue-per-message kind? i.e: with this model, would an outstanding IGVM_ATTEST request block a VPCI_DEVICE_CONTROL message?

As I understood the current GET protocol here, these "async" or "delegate" host requests are asynchronous with respect to the normal/serial host requests, but they are not asynchronous with respect to one another.

That is, it would be illegal for OpenHCL to have two of these requests outstanding at a given time, and so we should have them queue together.

In my opinion, it would be desirable for a future revision of the GET protocol to relax that serialization

Maybe "async" is a poor name for this "secondary queue of serialized requests"?

I haven't looked too deep at how the current host-code is structured, but if there is indeed a shared delegate for all async requests, then yes - this would be an accurate model of the current protocol revision.

one more question (asked without looking too deep at the current code) - do we guarantee that there is only ever one async request sent from OpenHCL at a time? If not, then we may want to tweak this code to buffer sending new async requests to the host if we have already sent one earlier (as opposed to sending out multiple requests, and then waiting for multiple responses).

That was my goal, but I am not convinced I have guaranteed it. I think it would be super useful to write a test that confirms the behavior but I'm still trying to figure out what test code is available for the GET client. Please let me know if you have any pointers/examples!

Requests sent to GED will be serialized, but there is no guarantee for the responses if they are handled asynchronously. I added a separate queue so that OpenHCL can check the asynchronous IGVM attest response separately from the main queue (which handles each request and response sequentially) so that we don’t block (or being blocked).

Now if we put another request in the same queue with IGVM attest, OpenHCL will handle the both responses sequentially; I.e., the responses for both request types will now block each other. If GED handles these requests asynchronously, we should put the new request in a new queue.

jackschefer-msft · 2025-03-05T23:09:51Z

@microsoft-github-policy-service agree company="Microsoft"

smalis-msft · 2025-03-08T17:32:46Z

vm/devices/get/guest_emulation_transport/src/process_loop.rs

-                // handled by the `host_requests` queue.
-                let run_next_igvm_attest = async {
-                    while let Some(request) = self.igvm_attest_requests.front_mut() {
+                // DEVNOTE: There may only be one outstanding async host request at a time, which means


If there can only be one, should async_host_requests be made an Option instead of a VecDeque?

this ties back to the discussion in #964 (comment)

theoretically, we could just buffer outgoing messages in a queue, and only release them once the host has responded the last-sent message. that way, you'd still only ever have one message in flight at the protocol level, but OpenHCL could start using a "more reasonable" infallible* API for outgoing requests.

*well, we might want to cap the queue at some reasonable value, to avoid unbounded growth

GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests

9805cb1

jackschefer-msft requested a review from a team as a code owner March 5, 2025 21:09

jackschefer-msft marked this pull request as draft March 5, 2025 21:10

daprilik reviewed Mar 5, 2025

View reviewed changes

smalis-msft reviewed Mar 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests #964

GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests #964

jackschefer-msft commented Mar 5, 2025 •

edited

Loading

daprilik Mar 5, 2025

jackschefer-msft Mar 5, 2025

daprilik Mar 5, 2025

daprilik Mar 5, 2025

jackschefer-msft Mar 5, 2025

mingweishih Mar 8, 2025

jackschefer-msft commented Mar 5, 2025

smalis-msft Mar 8, 2025

daprilik Mar 8, 2025

GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests #964

Are you sure you want to change the base?

GET: Allow VPCI_DEVICE_CONTROL to interleave with most host requests #964

Conversation

jackschefer-msft commented Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackschefer-msft commented Mar 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackschefer-msft commented Mar 5, 2025 •

edited

Loading