Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

FelixMcFelix · 2024-08-21T17:48:15Z

This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the ingot library. This enables a few changes which get us just shy of the 3Gbps mark.

[2.36 -> 2.7] The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit).
- Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an mblk_t into individual links.
- Packet field modifications are applied directly to the mblk_t as they happen, and field reads are made from the same source.
- Non-encap layers are not copied back out.
[2.7 -> 2.75] Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay.
[2.75 -> 2.8] Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets).
- VM-to-VM / intra-VPC traffic is the main use case here.
[2.8 -> 3.05] NetworkParsers now have the concept of inbound & outbound LightweightMeta formats. These support the key operations needed to execute all our UFT flows today (FlowId lookup, inner headers modification, encap push/pop, cksum update).
- This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up EmitSpec.
- This is crucial for outbound traffic in particular, which has far smaller (in struct-size) metadata.
- UFT misses or incompatible flows fallback to using the full metadata.
[3.05 -> 2.95] TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT.
- I do not have numbers on how large the performance loss would be if we held the Port lock for the whole time.
(Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow.
- This is now possible because of Move underlay NICs back into H/W Classification #504 -- software classification would have limited us to the default inbound queue/group.
- I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port 7777.
- Previously, Rx queue affinity was derived solely from (Src Sled, Dst Sled).

There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.

Each collected set of header transforms are Arc<>'d, such that we can apply them outside of the Port lock.
FlowTable<S>s now store Arc<FlowEntry<S>>, rather than FlowEntry<S>.
- This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the Port lock.
- This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup.
Opte::process returns an EmitSpec which is needed to finalise a packet before it can be used.
- I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need Packet to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.

Original testing notes.

This is not exactly a transformative increase, according to testing on glasgow. But it is an increase by around 15--20% zone-to-zone vs #504:

root@a:~# iperf -c 10.0.0.1 -P8
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 39797 connected to 10.0.0.1 port 5201
[  6] local 10.0.0.2 port 55568 connected to 10.0.0.1 port 5201
[  8] local 10.0.0.2 port 55351 connected to 10.0.0.1 port 5201
[ 10] local 10.0.0.2 port 49474 connected to 10.0.0.1 port 5201
[ 12] local 10.0.0.2 port 61952 connected to 10.0.0.1 port 5201
[ 14] local 10.0.0.2 port 47930 connected to 10.0.0.1 port 5201
[ 16] local 10.0.0.2 port 53057 connected to 10.0.0.1 port 5201
[ 18] local 10.0.0.2 port 63541 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  38.2 MBytes   320 Mbits/sec
[  6]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[  8]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[ 10]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 12]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 14]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec
[ 16]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec
[ 18]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[SUM]   0.00-1.00   sec   305 MBytes  2.56 Gbits/sec

...

- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  43.0 MBytes   361 Mbits/sec
[  6]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[  8]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[ 10]   9.00-10.00  sec  42.7 MBytes   358 Mbits/sec
[ 12]   9.00-10.00  sec  42.9 MBytes   360 Mbits/sec
[ 14]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[ 16]   9.00-10.00  sec  43.0 MBytes   360 Mbits/sec
[ 18]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[SUM]   9.00-10.00  sec   343 MBytes  2.88 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  4]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 16]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  sender
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  receiver

The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.

Zooming in a little on a representative call (percentages here of CPU time across examined stacks):

for context, xde_mc_tx is listed as taking 39.92% on this path, and str_mdata_fastpath_put as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.

+= 100 Mbps

FelixMcFelix · 2024-08-24T12:42:19Z

Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a -P 8 iperf run. (A single-threaded remains fairly uncontended.)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
260502  12%  41% 0.00      637 0xfffffcfa34386be8     _ZN4opte6engine4port13Port$LT$N$GT$12thin_process17h316b1c1b8ce14471E+0x7c

      nsec ------ Time Distribution ------ count                             
       256 |@@@@@@@@@@                     90486     
       512 |@@@@@@@@@@@                    97481     
      1024 |@@@@                           37487     
      2048 |@@                             20513     
      4096 |@                              11065     
      8192 |                               2786      
     16384 |                               533       
     32768 |                               116       
     65536 |                               20        
    131072 |                               13        
    262144 |                               1         
    524288 |                               1

This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled $A$ hosts ports $A_1, A_2$ and sled $B$ hosts ports $B_1, B_2$, all $A \leftrightarrow B$ combinations had the same outer 5-tuple $(A_{\mathit{IP6}},B_{\mathit{IP6}},\mathit{UDP},7777,6081)$ -- so identical Rx queue mapping. There's work to be done to get contention down further but that's beyond this PR's scope.

We don't actually lose any real-terms perf, go us.

Packet Rx is apparently 180% more costly now on `glasgow`.

TODO: find where the missing 250 Mbps has gone.

Notes from rough turning-off-and-on of the Old Way: * Thin process is slower than it was before. I suspect this is due to the larger amount of things which have been shoved into the full Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3. * Thin process has a bigger performance impact on the Rx pathway than Tx: - Rx-only: 2.8--2.9 - Tx-only: 2.74 - None: 2.7 - Old: <=2.5 There might be value in first-classing an extra parse state for the cases that we know we don't need to do arbitrary full-on transforms.

Will see if I can cleanup PktBodyWalker further.

Pretty helpful for showing off operation.

FelixMcFelix

I think I'm happy with this, barring some open questions I've left in self-review. Some 145 tests are working/passing/rewritten.

As far as reviewability goes, we could cut compiled UFTs into a follow-up PR if need be. I don't believe that this will bring the size of the diff down substantially (-1–1.5k lines?), given the nature of a stack rewrite like this.

FelixMcFelix · 2024-10-31T12:20:39Z

lib/opte/src/engine/port.rs

+    pub fn process<'a, M>(
        &self,
        dir: Direction,
-        pkt: &mut Packet<Parsed>,
-        mut ameta: ActionMeta,
-    ) -> result::Result<ProcessResult, ProcessError> {
-        let flow_before = *pkt.flow();
-        let epoch = self.epoch.load(SeqCst);
-        let mut data = self.data.lock();
+        // TODO: might want to pass in a &mut to an enum
+        // which can advance to (and hold) light->full-fat metadata.
+        // My gutfeel is that there's a perf cost here -- this struct
+        // is pretty fat, but expressing the transform on a &mut also sucks.
+        mut pkt: Packet<LiteParsed<MsgBlkIterMut<'a>, M>>,
+    ) -> result::Result<ProcessResult, ProcessError>


The new CompiledUft changes (and unification of process_in/process_out) are here. I'm not too happy about pkt being passed by value, given the size of even the lite metadata formats. If there are any ideas I'd be more than happy to see whta we can do here.

FelixMcFelix · 2024-11-01T14:24:28Z

lib/opte-test-utils/src/lib.rs

+) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> {
+    let pkt = Packet::new(pkt.iter_mut());
+    pkt.parse_inbound(parser)
+}


Removed as of cdf1d59.

lib/opte/src/ddi/mblk.rs

lib/opte/src/ddi/time.rs

xde/x86_64-unknown-unknown.json

Necessary to safely handle cases where, e.g., viona has pulled up part of the packet for headers, but anything after this cutoff is guest memory (thus, unsafe to construct a `&[u8]` or `&mut [u8]` over). This also ensures that any time we count the bytes in a MsgBlk b_cont chain, we do so exclusively using rptr and wptr (rather than constructing a slice). One piece left TODO is making sure that body transforms on such packets are properly handled.

Seems to more reliably push us up to >=3.0Gbps, primarily be eliding the fat `memcpy`s needed to move some of the metadata structs out (>128B).

lib/opte/src/ddi/mblk.rs

pfmooney · 2024-11-08T19:56:36Z

lib/opte/src/ddi/mblk.rs

+    /// * Return [`WrapError::Chain`] is `mp->b_next` or `mp->b_prev` are set.
+    pub unsafe fn wrap_mblk(ptr: *mut mblk_t) -> Result<Self, WrapError> {
+        let inner = NonNull::new(ptr).ok_or(WrapError::NullPtr)?;
+        let inner_ref = inner.as_ref();


If we're going to be turning the NonNull<mblk_t> into a reference (here, and elsewhere), we should probably verify that it meets the alignment requirements (not that there's any expectation that mblk pointers would fail them)

It seems like code uses raw pointers in some places, and references in others. Switching to raw pointers all the time might avoid some potential for UB relating to this, obviating the need for alignment checks on construction. I'm not sure which is the right approach for this abstraction.

I've ended up going for raw dereferences across the board as of 33137dd, for the sake of consistency.

lib/opte/src/ddi/mblk.rs

pfmooney · 2024-11-08T19:59:16Z

lib/opte/src/ddi/mblk.rs

+
+    /// Drops all empty mblks from the start of this chain where possible
+    /// (i.e., any empty mblk is followed by another mblk).
+    pub fn drop_empty_segments(&mut self) {


We should probably be wary about any of the associated metadata when we're dropping mblks from the b_cont chain. If the first mblk is empty, but bears flags regarding checksums or LSO, it would be a bother to lose that info. This applies to basically all operations which are manipulating or copying (including pullup) mblk packets.

We can copy db_struioun across during these operations. But I'm maybe unsure of what we should be doing with db_cksum{start, end, stuff} and whether there are any flags we should be neutering (HCK_PARTIALCKSUM?). I've caused a few panics in LSO testing by including that one.

This seems to run okay as of 33137dd.

pfmooney · 2024-11-08T21:50:06Z

lib/opte/src/ddi/mblk.rs

+}
+
+#[derive(Debug)]
+pub struct MsgBlkNode(mblk_t);


Would be nice to have some documentation here about when/why MsgBlkNode should be used in lieu of MsgBlk. In particular, why one isn't implemented in terms of the other.

I've put in some commentary here as of 27ecc8d, but we could push more methods down if required.

lib/opte/src/ddi/mblk.rs

FelixMcFelix added 10 commits August 19, 2024 11:52

Bump zerocopy to 0.8 prerelease, begin the Great Work

8fa90c8

Iterating...

3d9eb6f

The most hacked-together 'fast-path' imaginable.

cdea9f4

Merge branch 'master' into ingot

ff0fa3e

Merge conflict errors.

7ba068b

Attempt to minimise fastpath lock contention.

759b337

Nudge along CI a little bit

7b5100a

Wow, that was wasteful.

17b5c02

...CI?

b905c8c

...CI??

470f2bc

FelixMcFelix added the perf label Aug 22, 2024

FelixMcFelix added this to the 11 milestone Aug 22, 2024

FelixMcFelix added 2 commits August 23, 2024 18:57

Don't recompute flowhash for UFTs

6345ba6

+= 100 Mbps

Reintroduce headroom for ETH alignment.

fb4f9c1

twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024

FelixMcFelix added 7 commits August 29, 2024 12:29

Test fixups, post-V6EH world.

6200a29

We don't actually lose any real-terms perf, go us.

Try out emit_uninit.

f1346b6

Long march to complete integration... in progress.

c4be1cd

Good spot to call it for the day.

9a359cc

Iterating.

d417e69

We're now past OPTE::engine, at least.

4aabcb0

Lazy workarounds to get back to later.

750c8ac

FelixMcFelix self-assigned this Sep 6, 2024

FelixMcFelix added 6 commits September 6, 2024 18:16

Against the odds, XDE compiles.

c1a1658

Unbreak DHCPv4 responses.

f73b532

Works, but I am dropping perf on the floor somewhere now.

640b962

Packet Rx is apparently 180% more costly now on `glasgow`.

Actually remember to use cached l4 hash.

ea26bbd

TODO: find where the missing 250 Mbps has gone.

Re-enable cksum update when needed.

f88fe1b

FelixMcFelix added 8 commits October 31, 2024 14:51

SR pt.8

aac32dc

Self-review pt.9

bd7641a

Will see if I can cleanup PktBodyWalker further.

SR Pt.9

c3ebd68

Indicate fast/slowpath in port-process-return

5328994

Pretty helpful for showing off operation.

Inline to prevent port-process-return from dinging us.

9fa4856

SR Pt.9

d682ea0

Remove vestigial Packet<Initialized>.

cdf1d59

Whoops.

5e290a3

FelixMcFelix commented Nov 1, 2024

View reviewed changes

FelixMcFelix requested review from zeeshanlakhani, luqmana and pfmooney November 1, 2024 14:51

FelixMcFelix added 4 commits November 1, 2024 17:22

Update position on UFT compilation in the architecture doc

9393ad6

Mark parse errors in mod-level kstat.

a4648b4

Fuzzers, accidentally a file.

230e3c7

Missed doctests.

5afd63a

pfmooney reviewed Nov 6, 2024

View reviewed changes

lib/opte/src/ddi/mblk.rs Outdated Show resolved Hide resolved

lib/opte/src/ddi/time.rs Outdated Show resolved Hide resolved

xde/x86_64-unknown-unknown.json Show resolved Hide resolved

FelixMcFelix added 8 commits November 7, 2024 10:53

Review feedback -- de-pub inner, de-inner.

d8ab666

Document the UDP zero-checksum rationale.

470ed8a

Review feedback: document mblk refcnt assumptions.

5f53d2c

Ingot repo is opened now 🎉

2e800b3

Fixup body transform + pullup interaction.

69e844c

Fix chain truncation.

ac44642

Inline fastpath/lightweight parsing paths

b386de8

Seems to more reliably push us up to >=3.0Gbps, primarily be eliding the fat `memcpy`s needed to move some of the metadata structs out (>128B).

pfmooney reviewed Nov 8, 2024

View reviewed changes

FelixMcFelix added 4 commits November 11, 2024 16:05

Review feedback.

27ecc8d

Review feedback: copy db_struioun on pullup and chain trim

d7c8b21

All ptrs.

33137dd

Fmt.

569db38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

FelixMcFelix commented Aug 21, 2024 •

edited

Loading

FelixMcFelix commented Aug 24, 2024

FelixMcFelix left a comment •

edited

Loading

FelixMcFelix Oct 31, 2024

FelixMcFelix Nov 1, 2024

pfmooney Nov 8, 2024

FelixMcFelix Nov 11, 2024

pfmooney Nov 8, 2024

FelixMcFelix Nov 11, 2024

FelixMcFelix Nov 11, 2024

pfmooney Nov 8, 2024

FelixMcFelix Nov 11, 2024

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Are you sure you want to change the base?

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Conversation

FelixMcFelix commented Aug 21, 2024 • edited Loading

FelixMcFelix commented Aug 24, 2024

FelixMcFelix left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixMcFelix commented Aug 21, 2024 •

edited

Loading

FelixMcFelix left a comment •

edited

Loading