-
Notifications
You must be signed in to change notification settings - Fork 29
/
xdp-paper.tex
1356 lines (1178 loc) · 72.1 KB
/
xdp-paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% -*- TeX-engine: default; -*-
\documentclass[sigconf]{acmart}
\usepackage[font=footnotesize]{subcaption}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{url}
\usepackage{booktabs}
\usepackage{minted}
\usepackage{xcolor}
\usepackage{enumitem}
\usepackage{algorithm}
\usepackage[noend]{algpseudocode}
\usemintedstyle{friendly}
\setcopyright{ccbysa}
\acmYear{2018}
\copyrightyear{2018}
\acmConference[CoNEXT '18]{CoNEXT '18: International Conference on emerging
Networking EXperiments and Technologies}{December 4--7, 2018}{Heraklion,
Greece}
\acmBooktitle{CoNEXT '18: International Conference on emerging Networking
EXperiments and Technologies, December 4--7, 2018, Heraklion, Greece}
\acmDOI{10.1145/3281411.3281443}
\acmISBN{978-1-4503-6080-7/18/12}
%\widowpenalty=100
%\clubpenalty=100
\brokenpenalty=100
\begin{document}
\title{The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel}
\author{Toke Høiland-Jørgensen}
\affiliation{%
\institution{Karlstad University}}
\email{[email protected]}
\author{Jesper Dangaard Brouer}
\affiliation{%
\institution{Red Hat}}
\email{[email protected]}
\author{Daniel Borkmann}
\affiliation{%
\institution{Cilium.io}}
\email{[email protected]}
\author{John Fastabend}
\affiliation{%
\institution{Cilium.io}}
\email{[email protected]}
\author{Tom Herbert}
\affiliation{%
\institution{Quantonium Inc.}}
\email{[email protected]}
\author{David Ahern}
\affiliation{%
\institution{Cumulus Networks}}
\email{[email protected]}
\author{David Miller}
\affiliation{%
\institution{Red Hat}}
\email{[email protected]}
\renewcommand{\shortauthors}{T. Høiland-Jørgensen et al.}
\renewcommand{\shorttitle}{The eXpress Data Path}
\captionsetup{font+=small}
\hyphenation{exe-cution}
\begin{abstract}
% Max abstract length is 200 words
Programmable packet processing is increasingly implemented using kernel bypass
techniques, where a userspace application takes complete control of the
networking hardware to avoid expensive context switches between kernel and
userspace. However, as the operating system is bypassed, so are its
application isolation and security mechanisms; and well-tested configuration,
deployment and management tools cease to function.
To overcome this limitation, we present the design of a novel approach to
programmable packet processing, called the eXpress Data Path (XDP). In XDP,
the operating system kernel itself provides a safe execution environment for
custom packet processing applications, executed in device driver context. XDP
is part of the mainline Linux kernel and provides a fully integrated solution
working in concert with the kernel's networking stack. Applications are
written in higher level languages such as C and compiled into custom byte code
which the kernel statically analyses for safety, and translates into native
instructions.
We show that XDP achieves single-core packet processing performance as high as
24 million packets per second, and illustrate the flexibility of the
programming model through three example use cases: layer-3 routing, inline
DDoS protection and layer-4 load balancing.
\end{abstract}
\begin{CCSXML}
<ccs2012>
<concept>
<concept_id>10003033.10003034.10003038</concept_id>
<concept_desc>Networks~Programming interfaces</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10003033.10003099.10003102</concept_id>
<concept_desc>Networks~Programmable networks</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10011007.10010940.10010941.10010949</concept_id>
<concept_desc>Software and its engineering~Operating systems</concept_desc>
<concept_significance>300</concept_significance>
</concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[500]{Networks~Programming interfaces}
\ccsdesc[500]{Networks~Programmable networks}
\ccsdesc[300]{Software and its engineering~Operating systems}
\keywords{XDP, BPF, Programmable Networking, DPDK}
\maketitle
\section{Introduction}%
\label{sec:introduction}
High-performance packet processing in software requires very tight bounds on the
time spent processing each packet. Network stacks in general purpose operating
systems are typically optimised for flexibility, which means they perform too
many operations per packet to be able to keep up with these high packet rates.
This has led to the increased popularity of special-purpose toolkits for
software packet processing, such as the Data Plane Development Kit
(DPDK)~\cite{dpdk}. Such toolkits generally bypass the operating system
completely, instead passing control of the network hardware directly to the
network application and dedicating one, or several, CPU cores exclusively to
packet processing.
The kernel bypass approach can significantly improve performance, but has the
drawback that it is more difficult to integrate with the existing system, and
applications have to re-implement functionality otherwise provided by the
operating system network stack, such as routing tables and higher level
protocols. In the worst case, this leads to a scenario where packet processing
applications operate in a completely separate environment, where familiar
tooling and deployment mechanisms supplied by the operating system cannot be
used because of the need for direct hardware access. This results in increased
system complexity and blurs security boundaries otherwise enforced by the
operating system kernel. The latter is in particular problematic as
infrastructure moves towards container-based workloads coupled with
orchestration systems such as Docker or Kubernetes, where the kernel plays a
dominant role in resource abstraction and isolation.
As an alternative to the kernel bypass design, we present a system that adds
programmability directly in the operating system networking stack in a
cooperative way. This makes it possible to perform high-speed packet processing
that integrates seamlessly with existing systems, while selectively leveraging
functionality in the operating system. This framework, called the eXpress Data
Path (XDP), works by defining a limited execution environment in the form of a
virtual machine running \emph{eBPF} code, an extended version of original BSD
Packet Filter (BPF)~\cite{mccanne_bsd_1993} byte code format. This environment
executes custom programs directly in kernel context, before the kernel itself
touches the packet data, which enables custom processing (including redirection)
at the earliest possible point after a packet is received from the hardware. The
kernel ensures the safety of the custom programs by statically verifying them at
load time; and programs are dynamically compiled into native machine
instructions to ensure high performance.
XDP has been gradually integrated into the Linux kernel over several releases,
but no complete architectural description of the system as a whole has been
presented before. In this work we present a high-level design description of XDP
and its capabilities, and how it integrates with the rest of the Linux kernel.
Our performance evaluation shows raw packet processing performance of up to 24
million packets per second per CPU core. While this does not quite match the
highest achievable performance in a DPDK-based application on the same hardware,
we argue that the XDP system makes up for this by offering several compelling
advantages over DPDK and other kernel bypass solutions. Specifically, XDP:
\begin{itemize}
\item Integrates cooperatively with the regular networking stack, retaining full
control of the hardware in the kernel. This retains the kernel security
boundary, and requires no changes to network configuration and management
tools. In addition, any network adapter with a Linux driver can be supported
by XDP; no special hardware features are needed, and existing drivers only
need to be modified to add the XDP execution hooks.
\item Makes it possible to selectively utilise kernel network stack features
such as the routing table and TCP stack, keeping the same configuration
interface while accelerating critical performance paths.
\item Guarantees stability of both the eBPF instruction set and the programming
interface (API) exposed along with it.
\item Does not require expensive packet re-injection from user space into kernel
space when interacting with workloads based on the normal socket layer.
\item Is transparent to applications running on the host, enabling new
deployment scenarios such as inline protection against denial of service
attacks on servers.
\item Can be dynamically re-programmed without any service interruption, which
means that features can be added on the fly or removed completely when they
are not needed without interruption of network traffic, and that processing
can react dynamically to conditions in other parts of the system.
\item Does not require dedicating full CPU cores to packet processing, which
means lower traffic levels translate directly into lower CPU usage. This has
important efficiency and power saving implications.
\end{itemize}
In the rest of this paper we present the design of XDP and our performance
analysis. This is structured as follows: Section~\ref{sec:related-work} first
outlines related work. Section~\ref{sec:design} then presents the design of the
XDP system and Section~\ref{sec:perf-eval} presents our evaluation of its raw
packet processing performance. Section~\ref{sec:usecases} supplements this with
examples of real-world use cases that can be implemented with XDP. Finally,
Section~\ref{sec:future} discusses future directions of XDP, and
Section~\ref{sec:conclusion} concludes.
\section{Related work}%
\label{sec:related-work}
XDP is certainly not the first system enabling programmable packet processing.
Rather, this field has gained momentum over the last several years, and
continues to do so. Several frameworks have been presented to enable this kind
of programmability, and they have enabled many novel applications. Examples of
such applications include those performing single functions, such as
switching~\cite{rizzo2012vale}, routing~\cite{han2010packetshader}, named-based
forwarding~\cite{kirchner2016augustus}, classification~\cite{santiago2012wire},
caching~\cite{mansilha2015hierarchical} or traffic
generation~\cite{emmerich2015moongen}. They also include more general solutions
which are highly customisable and can operate on packets from a variety of
sources~\cite{han2012megapipe,marian2012netslices,linguaglossa2017high,morris1999click,dobrescu2009routebricks,openvswitch}.
To achieve high packet processing performance on Common Off The Shelf (COTS)
hardware, it is necessary to remove any bottlenecks between the networking
interface card (NIC) and the program performing the packet processing. Since one
of the main sources of performance bottlenecks is the interface between the
operating system kernel and the userspace applications running on top of it
(because of the high overhead of a system call and complexity of the underlying
feature-rich and generic stack), low-level packet processing frameworks have to
manage this overhead in one way or another. The existing frameworks, which have
enabled the applications mentioned above, take several approaches to ensuring
high performance; and XDP builds on techniques from several of them. In the
following we give a brief overview of the similarities and differences between
XDP and the most commonly used existing frameworks.
The DataPlane Development Kit (DPDK)~\cite{dpdk} is probably the most widely
used framework for high-speed packet processing. It started out as an
Intel-specific hardware support package, but has since seen a wide uptake under
the stewardship of the Linux Foundation. DPDK is a so-called \emph{kernel
bypass} framework, which moves the control of the networking hardware out of
the kernel into the networking application, completely removing the overhead of
the kernel-userspace boundary. Other examples of this approach include the
PF\_RING ZC module~\cite{pfringzc} and the hardware-specific Solarflare
OpenOnload~\cite{openonload}. Kernel bypass offers the highest performance of
the existing frameworks~\cite{gallenmuller_comparison_2015}; however, as
mentioned in the introduction, it has significant management, maintenance and
security drawbacks.
XDP takes an approach that is the opposite of kernel bypass: Instead of moving
control of the networking hardware \emph{out} of the kernel, the
performance-sensitive packet processing operations are moved \emph{into} the
kernel, and executed before the operating system networking stack begins its
processing. This retains the advantage of removing the kernel-userspace boundary
between networking hardware and packet processing code, while keeping the kernel
in control of the hardware, thus preserving the management interface and the
security guarantees offered by the operating system. The key innovation that
enables this is the use of a virtual execution environment that verifies that
loaded programs will not harm or crash the kernel.
Prior to the introduction of XDP, implementing packet processing functionality
as a kernel module has been a high-cost approach, since mistakes can crash the
whole system, and internal kernel APIs are subject to frequent change. For this
reason, it is not surprising that few systems have taken this approach. Of those
that have, the most prominent examples are the Open vSwitch~\cite{openvswitch}
virtual switch and the Click~\cite{morris1999click} and Contrail~\cite{contrail}
virtual router frameworks, which are all highly configurable systems with a wide
scope, allowing them to amortise the cost over a wide variety of uses. XDP
significantly lowers the cost for applications of moving processing into the
kernel, by providing a safe execution environment, and by being supported by the
kernel community, thus offering the same API stability guarantee as every other
interface the kernel exposes to userspace. In addition, XDP programs can
completely bypass the networking stack, which offers higher performance than a
traditional kernel module that needs to hook into the existing stack.
While XDP allows packet processing to move into the operating system for maximum
performance, it also allows the programs loaded into the kernel to selectively
redirect packets to a special user-space socket type, which bypasses the normal
networking stack, and can even operate in a zero-copy mode to further lower the
overhead. This operating mode is quite similar to the approach used by
frameworks such as Netmap~\cite{rizzo2012netmap} and
PF\_RING~\cite{deri2009modern}, which offer high packet processing performance
by lowering the overhead of transporting packet data from the network device to
a userspace application, without bypassing the kernel completely. The Packet I/O
engine that is part of PacketShader~\cite{han2010packetshader} is another
example of this approach, and it has some similarities with special-purpose
operating systems such as Arrakis~\cite{peter2016arrakis} and
ClickOS~\cite{martins2014clickos}.
Finally, programmable hardware devices are another way to achieve
high-performance packet processing. One example is the
NetFPGA~\cite{lockwood2007netfpga}, which exposes an API that makes it possible
to run arbitrary packet processing tasks on the FPGA-based dedicated hardware.
The P4 language~\cite{bosshart2014p4} seeks to extend this programmability to a
wider variety of packet processing hardware (including, incidently, an XDP
backend~\cite{p4xdp}). In a sense, XDP can be thought of as a ``software
offload'', where performance-sensitive processing is offloaded to increase
performance, while applications otherwise interact with the regular networking
stack. In addition, XDP programs that don't need to access kernel helper
functions can be offloaded entirely to supported networking hardware (currently
supported with Netronome smart-NICs~\cite{xdp-offload}).
In summary, XDP represents an approach to high-performance packet processing
that, while it builds on previous approaches, offers a new tradeoff between
performance, integration into the system and general flexibility. The next
section explains in more detail how XDP achieves this.
\section{The design of XDP}
\label{sec:design}
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/kernel-diagram.pdf}
\caption{\label{fig:xdp-kernel} XDP's integration with the Linux network stack.
On packet arrival, before touching the packet data, the device driver executes
an eBPF program in the main XDP hook. This program can choose to drop packets;
to send them back out the same interface it was received on; to redirect them,
either to another interface (including vNICs of virtual machines) or to
userspace through special AF\_XDP sockets; or to allow them to proceed to the
regular networking stack, where a separate TC BPF hook can perform further
processing before packets are queued for transmission. The different eBPF
programs can communicate with each other and with userspace through the use of
BPF maps. To simplify the diagram, only the ingress path is shown.}
\end{figure}
\begin{figure*}[t]
\centering
\includegraphics[width=\linewidth]{figures/xdp-execution-diagram.pdf}
\caption{\label{fig:xdp-execution} Execution flow of a typical XDP program. When
a packet arrives, the program starts by parsing packet headers to extract the
information it will react on. It then reads or updates metadata from one of
several sources. Finally, a packet can be rewritten and a final verdict for
the packet is determined. The program can alternate between packet parsing,
metadata lookup and rewriting, all of which are optional. The final verdict is
given in the form of a program return code.}
\end{figure*}
The driving rationale behind the design of XDP has been to allow
high-performance packet processing that can integrate cooperatively with the
operating system kernel, while ensuring the safety and integrity of the rest of
the system. This deep integration with the kernel obviously imposes some design
constraints, and the components of XDP have been gradually introduced into the
Linux kernel over a number of releases, during which the design has evolved
through continuous feedback and testing from the community.
Unfortunately, recounting the process and lessons learned is not possible in the
scope of this paper. Instead, this section describes the complete system, by
explaining how the major components of XDP work, and how they fit together to
create the full system. This is illustrated by Figure~\ref{fig:xdp-kernel},
which shows a diagram of how XDP integrates into the Linux kernel, and
Figure~\ref{fig:xdp-execution}, which shows the execution flow of a typical XDP
program. There are four major components of the XDP system:
\begin{itemize}
\item \textbf{The XDP driver hook} is the main entry point for an XDP program,
and is executed when a packet is received from the hardware.
\item \textbf{The eBPF virtual machine} executes the byte code of the XDP
program, and just-in-time-compiles it for increased performance.
\item \textbf{BPF maps} are key/value stores that serve as the primary
communication channel to the rest of the system.
\item \textbf{The eBPF verifier} statically verifies programs before they are
loaded to make sure they do not crash or corrupt the running kernel.
\end{itemize}
\subsection{The XDP Driver Hook}
\label{sec:prog-model}
An XDP program is run by a hook in the network device driver each time a packet
arrives. The infrastructure to execute the program is contained in the kernel as
a library function, which means that the program is executed directly in the
device driver, without context switching to userspace. As shown in
Figure~\ref{fig:xdp-kernel}, the program is executed at the earliest possible
moment after a packet is received from the hardware, before the kernel allocates
its per-packet \emph{sk\_buff} data structure or performs any parsing of the
packet.
Figure~\ref{fig:xdp-execution} shows the various processing steps typically
performed by an XDP program. The program starts its execution with access to a
context object. This object contains pointers to the raw packet data, along with
metadata fields describing which interface and receive queue the packet was
received on.
The program typically begins by parsing packet data, and can pass control to a
different XDP program through tail calls, thus splitting processing into logical
sub-units (based on, say, IP header version).
After parsing the packet data, the XDP program can use the context object to
read metadata fields associated with the packet, describing the interface and
receive queue the packet came from. The context object also gives access to a
special memory area, located adjacent in memory to the packet data. The XDP
program can use this memory to attach its own metadata to the packet, which will
be carried with it as it traverses the system.
In addition to the per-packet metadata, an XDP program can define and access its
own persistent data structures (through BPF maps, described in
Section~\ref{sec:bpf-maps} below), and it can access kernel facilities through
various helper functions. Maps allow the program to communicate with the rest of
the system, and the helpers allow it to selectively make use of existing kernel
functionality (such as the routing table), without having to go through the full
kernel networking stack. New helper functions are actively added by the kernel
development community in response to requests from the community, thus
continuously expanding the functionality that XDP programs can make use of.
Finally, the program can write any part of the packet data, including expanding
or shrinking the packet buffer to add or remove headers. This allows it to
perform encapsulation or decapsulation, as well as, for instance, rewrite
address fields for forwarding. Various kernel helper functions are available to
assist with things like checksum calculation for a modified packet.
These three steps (reading, metadata processing, and writing packet data)
correspond to the light grey boxes on the left side of
Figure~\ref{fig:xdp-execution}. Since XDP programs can contain arbitrary
instructions, the different steps can alternate and repeat in arbitrary ways.
However, to achieve high performance, it is often necessary to structure the
execution order as described here.
At the end of processing, the XDP program issues a final verdict for the packet.
This is done by setting one of the four available return codes, shown on the
right-hand side of Figure~\ref{fig:xdp-execution}. There are three simple return
codes (with no parameters), which can drop the packet, immediately re-transmit
it out the same network interface, or allow it to be processed by the kernel
networking stack. The fourth return code allows the XDP program to
\emph{redirect} the packet, offering additional control over its further
processing.
Unlike the other three return codes, the redirect packet verdict requires an
additional parameter that specifies the redirection target, which is set through
a helper function before the program exits. The redirect functionality can be
used (1) to transmit the raw packet out a different network interface (including
virtual interfaces connected to virtual machines), (2) to pass it to a different
CPU for further processing, or (3) to pass it directly to a special userspace
socket address family (\texttt{AF\_XDP}). These different packet paths are shown
as solid lines in Figure~\ref{fig:xdp-kernel}. The decoupling of the return code
and the target parameter makes redirection a flexible forwarding mechanism,
which can be extended with additional target types without requiring any special
support from either the XDP programs themselves, or the device drivers
implementing the XDP hooks. In addition, because the redirect parameter is
implemented as a map lookup (where the XDP program provides the lookup key),
redirect targets can be changed dynamically without modifying the program.
\subsection{The eBPF Virtual Machine}
\label{sec:bpf-vm}
XDP programs run in the Extended BPF (eBPF) virtual machine. eBPF is an
evolution of the original BSD packet filter (BPF) \cite{mccanne_bsd_1993} which
has seen extensive use in various packet filtering applications over the last
decades. BPF uses a register-based virtual machine to describe filtering
actions. The original BPF virtual machine has two 32-bit registers and
understands 22 different instructions. eBPF extends the number of registers to
eleven, and increases register widths to 64 bits. The 64-bit registers map
one-to-one to hardware registers on the 64-bit architectures supported by the
kernel, enabling efficient just-in-time (JIT) compilation into native machine
code. Support for compiling (restricted) C code into eBPF is included in the
LLVM compiler infrastructure~\cite{llvm}.
eBPF also adds new instructions to the eBPF instruction set. These include
arithmetic and logic instructions for the larger register sizes, as well as a
\emph{call} instruction for function calls. eBPF adopts the same calling
convention as the C language conventions used on the architectures supported by
the kernel. Along with the register mapping mentioned above, this makes it
possible to map a BPF call instruction to a single native call instruction,
enabling function calls with close to zero additional overhead. This facility is
used by eBPF to support helper functions that eBPF programs can call to interact
with the kernel while processing, as well as for function calls within the same
eBPF program.
While the eBPF instruction set itself can express any general purpose
computation, the verifier (described in Section~\ref{sec:bpf-verifier} below)
places limitations on the programs loaded into the kernel to ensure that the
user-supplied programs cannot harm the running kernel. With this in place, it is
safe to execute the code directly in the kernel address space, which makes eBPF
useful for a wide variety of tasks in the Linux kernel, not just for XDP.
Because all eBPF programs can share the same set of maps, this makes it possible
for programs to react to arbitrary events in other parts of the kernel. For
instance, a separate eBPF program could monitor CPU load and instruct an XDP
program to drop packets if load increases above a certain threshold.
The eBPF virtual machine supports dynamically loading and re-loading programs,
and the kernel manages the life cycle of all programs. This makes it possible to
extend or limit the amount of processing performed for a given situation, by
adding or completely removing parts of the program that are not needed, and
re-loading it atomically as requirements change. The dynamic loading of programs
also makes it possible to express processing rules directly in program code,
which for some applications can increase performance by replacing lookups into
general purpose data structures with simple conditional jumps.
\subsection{BPF Maps}
\label{sec:bpf-maps}
eBPF programs are executed in response to an event in the kernel (a packet
arrival, in the case of XDP). Each time they are executed they start in the same
initial state, and they do not have access to persistent memory storage in their
program context. Instead, the kernel exposes helper functions giving programs
access to \emph{BPF maps}.
BPF maps are key/value stores that are defined upon loading an eBPF program, and
can be referred to from within the eBPF code. Maps exist in both global and
per-CPU variants, and can be shared, both between different eBPF programs
running at various places in the kernel, as well as between eBPF and userspace.
The map types include generic hash maps, arrays and radix trees, as well as
specialised types containing pointers to eBPF programs (used for tail calls), or
redirect targets, or even pointers to other maps.
Maps serve several purposes: they are a persistent data store between
invocations of the same eBPF program; a global coordination tool, where eBPF
programs in one part of the kernel can update state that changes the behaviour
in another; and a communication mechanism between userspace programs and the
kernel eBPF programs, similar to the communication between control plane and
data plane in other programmable packet processing systems.
\subsection{The eBPF Verifier}
\label{sec:bpf-verifier}
Since eBPF code runs directly in the kernel address space, it can directly
access, and potentially corrupt, arbitrary kernel memory. To prevent this from
happening, the kernel enforces a single entry point for loading all eBPF
programs (through the \texttt{bpf()} system call). When loading an eBPF program
it is first analysed by the in-kernel eBPF verifier. The verifier performs a
static analysis of the program byte code to ensure that the program performs no
actions that are unsafe (such as accessing arbitrary memory), and that the
program will terminate. The latter is ensured by disallowing loops and limiting
the maximum program size. The verifier works by first building a directed
acyclic graph (DAG) of the control flow of the program. This DAG is then
verified as follows:
First, the verifier performs a depth-first search on the DAG to ensure it is in
fact acyclic, i.e., that it contains no loops, and also that it contains no
unsupported or unreachable instructions. Then, in a second pass, the verifier
walks all possible paths of the DAG. The purpose of this second pass is to
ensure that the program performs only safe memory accesses, and that any helper
functions are called with the right argument types. This is ensured by rejecting
programs that perform load or call instructions with invalid arguments. Argument
validity is determined by tracking the state of all registers and stack
variables through the execution of the program.
The purpose of this register state tracking mechanism is to ensure that the
program performs no out of bounds memory accesses \emph{without knowing in
advance what the valid bounds are}. The bounds cannot be known because
programs must process data packets which vary in size; and similarly, the
contents of maps are not known in advance, so it is not known whether a given
lookup will succeed. To deal with this, the verifier checks that the program
being loaded does its own bounds checking before dereferencing pointers to
packet data, and that map lookups are checked for \texttt{NULL} values before
being dereferenced. This approach leaves the program writer in control of how
checks are integrated into the processing logic, and what to do in the error
path.
To track data access, the verifier tracks data types, pointer offsets and
possible value ranges of all registers. At the beginning of the program,
\texttt{R1} contains a pointer to the \emph{context} metadata object,
\texttt{R10} is a stack pointer, and all other registers are marked as not
initialised. At each execution step, register states are updated based on the
operations performed by the program. When a new value is stored in a register,
that register inherits the state variables from the source of the value.
Arithmetic operations will affect the possible value ranges of scalar types, and
the offsets of pointer types. The widest possible range is stored in the state
variables, e.g., if a one-byte load is performed into a register, that
register's possible value range is set to 0-255. Branches in the instruction
graph will update the register state according to the logical operation
contained in the branch. For example, given a comparison such as "\texttt{R1 >
10}", the verifier will set the maximum value of \texttt{R1} to 10 in one
branch, and the minimum value to 11 in the other.
Using this range information stored in the state variables, it is possible for
the verifier to predict the ranges of memory that each load instruction can
potentially access, and ensure that only safe memory accesses are performed. For
packet data access this is done by looking for comparisons with the special
\texttt{data\_end} pointer that is available in the context object; for values
retrieved from a BPF map the data size in the map definition is used; and for
values stored on the stack, accesses are checked against the data ranges that
have previously been written to. Furthermore, restrictions are placed on pointer
arithmetic, and pointers cannot generally be converted to integer values. Any
eBPF program that performs operations that the verifier cannot prove are safe,
are simply rejected at load time. In addition to this, the verifier also uses
the range information to enforce aligned memory accesses.
It should be noted that the purpose of the verifier is to protect the internals
of the kernel from being exposed to malicious or buggy eBPF programs, not to
ensure that the programs perform their designated function in the most efficient
way possible. That is, an XDP program can slow down the machine by performing
excessive processing (up to the maximum program size), and it can corrupt
network packets if written incorrectly. Loading programs requires administrative
(root) privileges for this reason, and it is up to the eBPF programmer to
prevent these types of bugs, and to the system administrator to decide which
programs to load on the system.
\subsection{Example XDP program}
\label{sec:example-xdp-program}
\begin{listing}[p]
\caption{\label{lst:ex-xdp-prog}Example XDP program. The program parses packet
headers, swaps source and destination MAC addresses for all UDP packets, and
sends them back out the same interface. A packet counter is kept per IP
protocol number. Adapted from xdp2\_kern.c, which is distributed with the
kernel source code.}
\begin{minted}[linenos,xleftmargin=9pt,numbersep=5pt,fontsize=\footnotesize,tabsize=2]{c}
/* map used to count packets; key is IP protocol, value is pkt count */
struct bpf_map_def SEC("maps") rxcnt = {
.type = BPF_MAP_TYPE_PERCPU_ARRAY,
.key_size = sizeof(u32),
.value_size = sizeof(long),
.max_entries = 256,
};
/* swaps MAC addresses using direct packet data access */
static void swap_src_dst_mac(void *data)
{
unsigned short *p = data;
unsigned short dst[3];
dst[0] = p[0]; dst[1] = p[1]; dst[2] = p[2];
p[0] = p[3]; p[1] = p[4]; p[2] = p[5];
p[3] = dst[0]; p[4] = dst[1]; p[5] = dst[2];
}
static int parse_ipv4(void *data, u64 nh_off, void *data_end)
{
struct iphdr *iph = data + nh_off;
if (iph + 1 > data_end)
return 0;
return iph->protocol;
}
SEC("xdp1") /* marks main eBPF program entry point */
int xdp_prog1(struct xdp_md *ctx)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct ethhdr *eth = data; int rc = XDP_DROP;
long *value; u16 h_proto; u64 nh_off; u32 ipproto;
nh_off = sizeof(*eth);
if (data + nh_off > data_end)
return rc;
h_proto = eth->h_proto;
/* check VLAN tag; could be repeated to support double-tagged VLAN */
if (h_proto == htons(ETH_P_8021Q) || h_proto == htons(ETH_P_8021AD)) {
struct vlan_hdr *vhdr;
vhdr = data + nh_off;
nh_off += sizeof(struct vlan_hdr);
if (data + nh_off > data_end)
return rc;
h_proto = vhdr->h_vlan_encapsulated_proto;
}
if (h_proto == htons(ETH_P_IP))
ipproto = parse_ipv4(data, nh_off, data_end);
else if (h_proto == htons(ETH_P_IPV6))
ipproto = parse_ipv6(data, nh_off, data_end);
else
ipproto = 0;
/* lookup map element for ip protocol, used for packet counter */
value = bpf_map_lookup_elem(&rxcnt, &ipproto);
if (value)
*value += 1;
/* swap MAC addrs for UDP packets, transmit out this interface */
if (ipproto == IPPROTO_UDP) {
swap_src_dst_mac(data);
rc = XDP_TX;
}
return rc;
}
\end{minted}
\end{listing}
To showcase the features described above, Listing~\ref{lst:ex-xdp-prog} shows an
example of a simple XDP program. The program will parse packet headers, and
reflect all UDP packets by swapping the source and destination MAC addresses and
sending the packet back out the interface it came in on. While this is obviously
a very simple example, the program does feature most of the components of an XDP
program that is useful in the real world. Specifically:
\begin{itemize}
\item A BPF map is defined (lines 1--7) for keeping statistics of the number of
processed packets. The map is keyed on IP protocol number and each value is
simply a packet count (updated in lines 60--62). A userspace program can poll
this map to output statistics while the XDP program is running.
\item Pointers to the start and end of the packet data is read from the context
object (lines 30--31), to be used for direct packet data access.
\item Checking against the data\_end pointer ensures that no data is read out of
bounds (lines 22, 36 and 47). The verifier ensures correctness even across
pointer copies (as in lines 21--22).
\item The program must handle any packet parsing itself, including things such
as VLAN headers (lines 41--50).
\item Direct packet data access is used to modify the packet headers (lines
14--16).
\item The map lookup helper function exposed by the kernel (called on line 60).
This is the only real function call in the program; all other functions are
inlined on compilation, including helpers like \texttt{htons()}.
\item The final packet verdict is communicated by the program return code (line
69).
\end{itemize}
When the program is installed on an interface, it is first compiled into eBPF
byte code, then checked by the verifier. The notable things checked by the
verifier in this case are (a) the absence of loops, and the total size of the
program, (b) that all direct packet data accesses are preceded by appropriate
bounds checking (c) that the sizes of parameters passed to the map lookup
function matches the map definition, and (d) that the return value from the map
lookup is checked against NULL before it is accessed.
\subsection{Summary}
\label{sec:design-summary}
The XDP system consists of four major components: (1) The XDP device driver hook
which is run directly after a packet is received from the hardware. (2) The eBPF
virtual machine which is responsible for the actual program execution (and is
also used for executing programs in other parts of the kernel). (3) BPF maps,
which allow programs running in various parts of the kernel to communicate with
each other and with userspace. And (4) The eBPF verifier, which ensures programs
do not perform any operations that can harm the running kernel.
These four components combine to create a powerful environment for writing
custom packet processing applications, that can accelerate packet processing in
essential paths, while integrating with the kernel and making full use of its
existing facilities. The performance achievable by these applications is the
subject of the next section.
\section{Performance evaluation}
\label{sec:perf-eval}
In this section we present our performance evaluation of XDP. As mentioned in
Section~\ref{sec:related-work}, there are quite a few existing systems for
high-performance packet processing, and benchmarking all of them is not feasible
in the scope of this paper. Instead, we note that DPDK is the existing solution
that achieves the highest performance~\cite{gallenmuller_comparison_2015}, and
compare against that as a baseline for the current state of the art in
high-speed software packet processing (using the \texttt{testpmd} example
application shipped with the 18.05 release of DPDK). We focus on the raw packet
processing performance, using synthetic benchmarks, and also compare against the
performance of the Linux kernel network stack, to show the performance
improvements offered by XDP in the same system. In the next section, we
supplement these raw performance benchmarks with some examples of real-world
applications implemented on top of XDP, to demonstrate their feasibility within
the programming model.
For all benchmarks, we use a machine equipped with a hexa-core Intel Xeon
E5-1650 v4 CPU running at 3.60GHz, which supports Intel's Data Direct I/O (DDIO)
technology allowing the networking hardware Direct Memory Access (DMA) system to
place packet data directly in the CPU cache. The test machine is equipped with
two Mellanox ConnectX-5 Ex VPI dual-port 100Gbps network adapters, which are
supported by the \emph{mlx5} driver. We use the TRex packet
generator~\cite{cisco18:_trex_traff_gener} to produce the test traffic. The test
machine runs a pre-release of version 4.18 of the Linux kernel. To help others
reproduce our results, we make available the full details of our setup, along
with links to source code and the raw test data, in an online
repository~\cite{test-data}.
In our evaluation, we focus on three metrics:
\begin{itemize}
\item Packet drop performance. To show the maximum packet processing
performance, we measure the performance of the simplest possible operation of
dropping the incoming packet. This effectively measures the overhead of the
system as a whole, and serves as an upper bound on the expected performance of
a real packet processing application.
\item CPU usage. As mentioned in the introduction, one of the benefits of XDP is
that it scales the CPU usage with the packet load, instead of dedicating CPU
cores exclusively to packet processing. We quantify this by measuring how CPU
usage scales with the offered network load.
\item Packet forwarding performance. A packet processing system that cannot
forward packets has limited utility. Since forwarding introduces an additional
complexity compared to the simple processing case (e.g., interacting with more
than one network adapter, rewriting link-layer headers, etc.), a separate
evaluation of forwarding performance is useful. We include both throughput and
latency in the forwarding evaluation.
\end{itemize}
We have verified that with full-sized (1500 bytes) packets, our system can
process packets at line-speed (100\,Gbps) on a single core that is half-idle.
This makes it clear that the challenge is processing many \emph{packets} per
second, as others have also noted~\cite{rizzo2012netmap}. For this reason, we
perform all tests using minimum-sized (64 bytes) packets, and measure the
maximum number of packets per second the system can process. To measure how
performance scales with the number of CPU cores, we repeat the tests with an
increasing number of cores dedicated to packet processing.\footnote{The
Hyperthreading feature of the CPU is disabled for our experiments, so whenever
we refer to the number of active CPU cores, this means the number of physical
cores.} For XDP and the Linux network stack (which do not offer an explicit
way to dedicate cores to packet processing) we achieve this by configuring the
hardware Receive Side Scaling (RSS) feature to steer traffic to the desired
number of cores for each test.
As we will see in the results below, our tests push the hardware to its very
limits. As such, tuning the performance of the system as a whole is important to
realise optimal performance. This includes the physical hardware configuration,
configuration of the network adapter features such as Ethernet flow control and
receive queue size, and configuration parameters of the Linux kernel, where we
for instance disable full preemption and the ``retpoline'' mitigation for the
recent Meltdown and Spectre vulnerabilities. The full details of these
configuration issues are omitted here due to space constraints, but are
available in the online repository.
The following subsections present the evaluation results for each of the metrics
outlined above, followed by a general discussion of the performance of XDP
compared to the other systems. As all our results are highly repeatable, we show
results from a single test run (with no error bars) to make the graphs more
readable.
\subsection{Packet Drop Performance}
\label{sec:basel-pack-proc}
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/drop-test.pdf}
\caption{\label{fig:drop-test} Packet drop performance. DPDK uses one core for
control tasks, so only 5 are available for packet processing.}
\end{figure}
Figure~\ref{fig:drop-test} shows the packet drop performance as a function of
the number of cores. The baseline performance of XDP for a single core is
24\,Mpps, while for DPDK it is 43.5\,Mpps. Both scale their performance linearly
until they approach the global performance limit of the PCI bus, which is
reached at 115\,Mpps after enabling PCI descriptor compression support in the
hardware (trading CPU cycles for PCI bus bandwidth).
The figure also shows the performance of the Linux networking stack in two
configurations: one where we use the ``raw'' table of the \emph{iptables}
firewall module to drop packets, which ensures the earliest possible drop in the
network stack processing; and another where we use the connection tracking
(\emph{conntrack}) module, which carries a high overhead, but is enabled by
default on many Linux distributions. These two modes illustrate the performance
span of the Linux networking stack, from 1.8 Mpps of single-core performance
with conntrack, up to 4.8 Mpps in raw mode. It also shows that in the absence of
hardware bottlenecks, Linux performance scales linearly with the number of
cores. And finally, it shows that with its 24 Mpps on a single core, XDP offers
a five-fold improvement over the fastest processing mode of the regular
networking stack.
As part of this Linux raw mode test, we also measured the overhead of XDP by
installing an XDP program that does no operation other than updating packets
counters and passing the packet on to the stack. We measured a drop in
performance to 4.5\,Mpps on a single core, corresponding to 13.3\,ns of
processing overhead. This is not shown on the figure, as the difference is too
small to be legible.
\subsection{CPU Usage}
\label{sec:cpu-usage}
We measure the CPU usage of the different tested systems when running the packet
drop application on a single CPU core, by recording the percentage of CPU busy
time using the \texttt{mpstat} system utility. The results of this is shown in
Figure~\ref{fig:drop-cpu}. The test varies the offered packet load up to the
maximum that each system can handle on a single core.
Since DPDK by design dedicates a full core to packet processing, and uses busy
polling to process the packets, its CPU usage is always pegged at 100\%, which
is the green line at the top of the figure. In contrast, both XDP and Linux
smoothly scale CPU usage with the offered load, with a slightly larger relative
increase in CPU usage at a small offered load level.
The non-linearity of the graph in the bottom-left corner is due to the fixed
overhead of interrupt processing. At lower packet rates, the number of packets
processed during each interrupt is smaller, leading to higher CPU usage per
packet.
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/drop-cpu.pdf}
\caption{\label{fig:drop-cpu} CPU usage in the drop scenario. Each line stops at
the method's maximum processing capacity. The DPDK line continues at 100\% up
to the maximum performance shown in Figure~\ref{fig:drop-test}.}
\end{figure}
\subsection{Packet Forwarding Performance}
\label{sec:pack-forw-perf}
Figure~\ref{fig:redirect-test} shows packet forwarding performance. The
forwarding applications perform a simple Ethernet address rewrite, where the
source and destination address of the incoming packet are swapped before the
packet is forwarded. This is the minimum rewriting that is needed for packet
forwarding to function, so the results represent an upper bound on forwarding
performance. Since XDP can forward packets out the same NIC as well as out a
different NIC (using two different program return codes), we include both modes
in the graph. The DPDK example program only supports forwarding packets through
a different interface, so we only include this operating mode in the test.
Finally, the Linux networking stack does not support this minimal forwarding
mode, but requires a full bridging or routing lookup to forward packets; this
lookup is expensive, and since the other applications do not perform it, the
results are not directly comparable. For this reason, we omit the Linux
networking stack from these results, and instead include the Linux routing
performance in our routing use case presented in Section~\ref{sec:fwd-usecase}.
As Figure~\ref{fig:redirect-test} shows, we again see linear scaling with the
number of cores up to a global performance bottleneck. The absolute performance
is somewhat lower than for the packet drop case, which shows the overhead of
packet forwarding. We also see that the XDP performance improves significantly
when packets are sent out on the same interface that they were received on,
surpassing the DPDK forwarding performance at two cores and above. The
performance difference is primarily due to differences in memory handling:
packet buffers are allocated by the device driver and associated with the
receiving interface. And so, when the packet is forwarded out a different
interface, the memory buffer needs to be returned to the interface that it is
associated with.
Looking at forwarding latency, as seen in Table~\ref{tbl:fwd-latency}, the
relative performance of XDP and DPDK for different-NIC forwarding are reflected
for the high packet rate test (with DPDK showing slightly lower variance as
well). However, for low packet rates, the latency of XDP is dominated by the
interrupt processing time, which leads to much higher end-to-end latency than
DPDK achieves with constant polling.
\begin{table}[tbp]
\caption{\label{tbl:fwd-latency}Packet forwarding latency. Measurement machine
connected to two ports on the same NIC, measuring end-to-end latency for 50
seconds with high and low packet rates (100 pps and 1 Mpps).}
\centering
\begin{tabular}{lcccccc}
\toprule
& \multicolumn{2}{c}{Average} & \multicolumn{2}{c}{Maximum} & \multicolumn{2}{c}{$< 10 \mu s$} \\
& \small 100 pps & \small 1 Mpps & \small 100 pps & \small 1 Mpps & \small 100 pps & \small 1 Mpps \\
\midrule
XDP & $82 \mu s$ & $7 \mu s$ & $272 \mu s$ & $202 \mu s$ & $0 \%$ & $98.1 \%$ \\
DPDK & $2 \mu s$ & $3 \mu s$ & $161 \mu s$ & $189 \mu s$ & $99.5 \%$ & $99.0 \%$ \\
\bottomrule
\end{tabular}
\end{table}
\begin{figure}[t]
\centering
\includegraphics[width=\linewidth]{figures/redirect-test.pdf}
\caption{\label{fig:redirect-test} Packet forwarding throughput. Sending and
receiving on the same interface takes up more bandwidth on the same PCI port,
which means we hit the PCI bus limit at 70 Mpps.}
\end{figure}
\subsection{Discussion}
\label{sec:perf-discussion}
As we have seen in the previous subsections, XDP achieves significantly higher
performance than the regular Linux networking stack. Even so, for most use cases
XDP does not quite match the performance of DPDK. We believe this is primarily
because DPDK has incorporated more performance optimisations at the lowest level
of the code. To illustrate this, consider the packet drop example: XDP achieves
24\,Mpps on a single core, which corresponds to 41.6 nanoseconds per packet,
while DPDK achieves 43.5\,Mpps, or 22.9 nanoseconds per packet. The difference
of 18.7 nanoseconds corresponds to 67 clock cycles on the 3.6\,GHz processor in
our test machine. Thus, it is clear that every micro-optimisation counts; for
example, we measure an overhead of 1.3 nanoseconds for a single function call on
our test system. The mlx5 driver performs 10 function calls when processing a
single packet, corresponding to 13 of the 18.7 nanoseconds of performance
difference between XDP and DPDK.
Some of this overhead is inevitable on a general-purpose operating system such
as Linux, as device drivers and subsystems are structured in a way that makes it
possible to support a wide variety of systems and configurations. However, we
believe that some optimisations are viable. For instance, we have performed an
experiment that removed DMA-related function calls that were not needed on our
specific hardware from the driver, removing four of the 10 per-packet function
calls. This improved the packet drop performance to 29\,Mpps. Extrapolating
this, removing all function calls would increase performance to 37.6\,Mpps.
While this is not possible in practice, it is possible to remove some of them,
and combining this with other performance optimisations, we believe it is
reasonable to expect the performance gap between DPDK and XDP to lessen over
time. We see similar effects with other drivers, such as the \emph{i40e} driver
for 40\,Gbps Intel cards, which achieves full performance up to the NIC hardware
performance limit with both XDP and DPDK.\footnote{While DPDK uses the drivers
in the operating system to assume control of the hardware, it contains its own
drivers that are used for the actual packet processing.}
%
% See: benchmarks/bench02_xdp_drop.org, section "Intel 40G driver i40e" of why
% i40e was uninteresting, as HW is limiting factor.
Given the above points, we believe it is feasible for XDP to further decrease
the performance delta to DPDK. However, given the benefits of XDP in terms of
flexibility and integration with the rest of the system, XDP is already a
compelling choice for many use cases; we show some examples of this in the next
section.
\section{Real-world use cases}
\label{sec:usecases}
To show how the various aspects of XDP can be used to implement useful
real-world applications, this section describes three example use cases. These
use cases have all seen deployment in one form or another, although we use