-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585
base: master
Are you sure you want to change the base?
Conversation
Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a
This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled |
We don't actually lose any real-terms perf, go us.
Packet Rx is apparently 180% more costly now on `glasgow`.
TODO: find where the missing 250 Mbps has gone.
Notes from rough turning-off-and-on of the Old Way: * Thin process is slower than it was before. I suspect this is due to the larger amount of things which have been shoved into the full Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3. * Thin process has a bigger performance impact on the Rx pathway than Tx: - Rx-only: 2.8--2.9 - Tx-only: 2.74 - None: 2.7 - Old: <=2.5 There might be value in first-classing an extra parse state for the cases that we know we don't need to do arbitrary full-on transforms.
Will see if I can cleanup PktBodyWalker further.
Pretty helpful for showing off operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm happy with this, barring some open questions I've left in self-review. Some 145 tests are working/passing/rewritten.
As far as reviewability goes, we could cut compiled UFTs into a follow-up PR if need be. I don't believe that this will bring the size of the diff down substantially (-1–1.5k lines?), given the nature of a stack rewrite like this.
lib/opte/src/engine/port.rs
Outdated
pub fn process<'a, M>( | ||
&self, | ||
dir: Direction, | ||
pkt: &mut Packet<Parsed>, | ||
mut ameta: ActionMeta, | ||
) -> result::Result<ProcessResult, ProcessError> { | ||
let flow_before = *pkt.flow(); | ||
let epoch = self.epoch.load(SeqCst); | ||
let mut data = self.data.lock(); | ||
// TODO: might want to pass in a &mut to an enum | ||
// which can advance to (and hold) light->full-fat metadata. | ||
// My gutfeel is that there's a perf cost here -- this struct | ||
// is pretty fat, but expressing the transform on a &mut also sucks. | ||
mut pkt: Packet<LiteParsed<MsgBlkIterMut<'a>, M>>, | ||
) -> result::Result<ProcessResult, ProcessError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new CompiledUft
changes (and unification of process_in
/process_out
) are here. I'm not too happy about pkt
being passed by value, given the size of even the lite metadata formats. If there are any ideas I'd be more than happy to see whta we can do here.
) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> { | ||
let pkt = Packet::new(pkt.iter_mut()); | ||
pkt.parse_inbound(parser) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed as of cdf1d59.
Necessary to safely handle cases where, e.g., viona has pulled up part of the packet for headers, but anything after this cutoff is guest memory (thus, unsafe to construct a `&[u8]` or `&mut [u8]` over). This also ensures that any time we count the bytes in a MsgBlk b_cont chain, we do so exclusively using rptr and wptr (rather than constructing a slice). One piece left TODO is making sure that body transforms on such packets are properly handled.
Seems to more reliably push us up to >=3.0Gbps, primarily be eliding the fat `memcpy`s needed to move some of the metadata structs out (>128B).
lib/opte/src/ddi/mblk.rs
Outdated
/// * Return [`WrapError::Chain`] is `mp->b_next` or `mp->b_prev` are set. | ||
pub unsafe fn wrap_mblk(ptr: *mut mblk_t) -> Result<Self, WrapError> { | ||
let inner = NonNull::new(ptr).ok_or(WrapError::NullPtr)?; | ||
let inner_ref = inner.as_ref(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to be turning the NonNull<mblk_t>
into a reference (here, and elsewhere), we should probably verify that it meets the alignment requirements (not that there's any expectation that mblk pointers would fail them)
It seems like code uses raw pointers in some places, and references in others. Switching to raw pointers all the time might avoid some potential for UB relating to this, obviating the need for alignment checks on construction. I'm not sure which is the right approach for this abstraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ended up going for raw dereferences across the board as of 33137dd, for the sake of consistency.
|
||
/// Drops all empty mblks from the start of this chain where possible | ||
/// (i.e., any empty mblk is followed by another mblk). | ||
pub fn drop_empty_segments(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably be wary about any of the associated metadata when we're dropping mblks from the b_cont
chain. If the first mblk is empty, but bears flags regarding checksums or LSO, it would be a bother to lose that info. This applies to basically all operations which are manipulating or copying (including pullup) mblk packets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can copy db_struioun
across during these operations. But I'm maybe unsure of what we should be doing with db_cksum{start, end, stuff}
and whether there are any flags we should be neutering (HCK_PARTIALCKSUM
?). I've caused a few panics in LSO testing by including that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to run okay as of 33137dd.
lib/opte/src/ddi/mblk.rs
Outdated
} | ||
|
||
#[derive(Debug)] | ||
pub struct MsgBlkNode(mblk_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have some documentation here about when/why MsgBlkNode
should be used in lieu of MsgBlk
. In particular, why one isn't implemented in terms of the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put in some commentary here as of 27ecc8d, but we could push more methods down if required.
This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the
ingot
library. This enables a few changes which get us just shy of the 3Gbps mark.mblk_t
into individual links.mblk_t
as they happen, and field reads are made from the same source.NetworkParser
s now have the concept of inbound & outboundLightweightMeta
formats. These support the key operations needed to execute all our UFT flows today (FlowId
lookup, inner headers modification, encap push/pop, cksum update).EmitSpec
.struct
-size) metadata.Port
lock for the whole time.7777
.(Src Sled, Dst Sled)
.There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.
Arc<>
'd, such that we can apply them outside of thePort
lock.FlowTable<S>
s now storeArc<FlowEntry<S>>
, rather thanFlowEntry<S>
.Port
lock.Opte::process
returns anEmitSpec
which is needed to finalise a packet before it can be used.Packet
to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).Closes #571, closes #481, closes #460.
Slightly alleviates #435.
Original testing notes.
This is not exactly a transformative increase, according to testing on
glasgow
. But it is an increase by around 15--20% zone-to-zone vs #504:The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.
Zooming in a little on a representative call (percentages here of CPU time across examined stacks):
for context,
xde_mc_tx
is listed as taking 39.92% on this path, andstr_mdata_fastpath_put
as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.