-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netkvm: receive/transmit performance vastly different #1026
Comments
I suggest to ran as a benchmark first iperf in the guest without WSL1. Best regards, |
Another important comment: please run test with one stream. |
Wait, there's Windows binaries for iperf? Haha. I'll try that. iperf is actually to make it more reproducible. What started this was my copies over Samba being slow from Host to Guest. |
OK I've had to flip the server/client (run the iperf client on the host, server in the guest), but the results are the same. I used this binary without WSL: https://iperf.fr/iperf-download.php MTU=1500 $ iperf3 --time 30 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 2.43 GBytes 695 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 2.43 GBytes 695 Mbits/sec receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 2.26 GBytes 647 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 2.26 GBytes 646 Mbits/sec receiver
[ 7] 0.00-30.00 sec 2.18 GBytes 624 Mbits/sec 0 sender
[ 7] 0.00-30.00 sec 2.18 GBytes 623 Mbits/sec receiver
[ 9] 0.00-30.00 sec 2.45 GBytes 700 Mbits/sec 0 sender
[ 9] 0.00-30.00 sec 2.44 GBytes 699 Mbits/sec receiver
[ 11] 0.00-30.00 sec 2.51 GBytes 719 Mbits/sec 0 sender
[ 11] 0.00-30.00 sec 2.51 GBytes 718 Mbits/sec receiver
[SUM] 0.00-30.00 sec 9.40 GBytes 2.69 Gbits/sec 0 sender
[SUM] 0.00-30.00 sec 9.38 GBytes 2.69 Gbits/sec receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 1.10 GBytes 315 Mbits/sec 0 sender
[ 5] 0.00-30.00 sec 1.10 GBytes 315 Mbits/sec receiver
[ 7] 0.00-30.00 sec 1.80 GBytes 517 Mbits/sec 0 sender
[ 7] 0.00-30.00 sec 1.80 GBytes 516 Mbits/sec receiver
[ 9] 0.00-30.00 sec 2.07 GBytes 594 Mbits/sec 0 sender
[ 9] 0.00-30.00 sec 2.07 GBytes 592 Mbits/sec receiver
[ 11] 0.00-30.00 sec 2.06 GBytes 591 Mbits/sec 0 sender
[ 11] 0.00-30.00 sec 2.06 GBytes 590 Mbits/sec receiver
[ 13] 0.00-30.00 sec 1.08 GBytes 310 Mbits/sec 0 sender
[ 13] 0.00-30.00 sec 1.08 GBytes 309 Mbits/sec receiver
[ 15] 0.00-30.00 sec 1.08 GBytes 309 Mbits/sec 1 sender
[ 15] 0.00-30.00 sec 1.07 GBytes 308 Mbits/sec receiver
[ 17] 0.00-30.00 sec 1.10 GBytes 314 Mbits/sec 1 sender
[ 17] 0.00-30.00 sec 1.09 GBytes 313 Mbits/sec receiver
[ 19] 0.00-30.00 sec 2.13 GBytes 610 Mbits/sec 0 sender
[ 19] 0.00-30.00 sec 2.13 GBytes 609 Mbits/sec receiver
[SUM] 0.00-30.00 sec 12.4 GBytes 3.56 Gbits/sec 2 sender
[SUM] 0.00-30.00 sec 12.4 GBytes 3.55 Gbits/sec receiver I had a kernel panic when first running parallelism=8. Second run was OK. There's still nonlinear scaling. MTU=9000 $ iperf3 --time 30 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 17.2 GBytes 4.94 Gbits/sec 0 sender
[ 5] 0.00-30.00 sec 17.2 GBytes 4.94 Gbits/sec receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 10.1 GBytes 2.88 Gbits/sec 1 sender
[ 5] 0.00-30.00 sec 10.1 GBytes 2.88 Gbits/sec receiver
[ 7] 0.00-30.00 sec 10.8 GBytes 3.10 Gbits/sec 0 sender
[ 7] 0.00-30.00 sec 10.8 GBytes 3.10 Gbits/sec receiver
[ 9] 0.00-30.00 sec 10.2 GBytes 2.92 Gbits/sec 2 sender
[ 9] 0.00-30.00 sec 10.2 GBytes 2.92 Gbits/sec receiver
[ 11] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec 2 sender
[ 11] 0.00-30.00 sec 10.5 GBytes 3.00 Gbits/sec receiver
[SUM] 0.00-30.00 sec 41.6 GBytes 11.9 Gbits/sec 5 sender
[SUM] 0.00-30.00 sec 41.6 GBytes 11.9 Gbits/sec receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec 1 sender
[ 5] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec receiver
[ 7] 0.00-30.00 sec 5.47 GBytes 1.57 Gbits/sec 1 sender
[ 7] 0.00-30.00 sec 5.47 GBytes 1.57 Gbits/sec receiver
[ 9] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec 1 sender
[ 9] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec receiver
[ 11] 0.00-30.00 sec 5.11 GBytes 1.46 Gbits/sec 1 sender
[ 11] 0.00-30.00 sec 5.11 GBytes 1.46 Gbits/sec receiver
[ 13] 0.00-30.00 sec 5.33 GBytes 1.53 Gbits/sec 0 sender
[ 13] 0.00-30.00 sec 5.33 GBytes 1.53 Gbits/sec receiver
[ 15] 0.00-30.00 sec 5.07 GBytes 1.45 Gbits/sec 2 sender
[ 15] 0.00-30.00 sec 5.07 GBytes 1.45 Gbits/sec receiver
[ 17] 0.00-30.00 sec 5.40 GBytes 1.55 Gbits/sec 0 sender
[ 17] 0.00-30.00 sec 5.40 GBytes 1.54 Gbits/sec receiver
[ 19] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec 2 sender
[ 19] 0.00-30.00 sec 4.81 GBytes 1.38 Gbits/sec receiver
[SUM] 0.00-30.00 sec 41.4 GBytes 11.9 Gbits/sec 8 sender
[SUM] 0.00-30.00 sec 41.4 GBytes 11.9 Gbits/sec receiver The CPU load with parallel iperf3 is the same - parallelism 4 = 4 cpus at 100%, parallelism 8 = 8 cpus at 100% etc. |
Can you please share the crash dump? |
Unfortunately I didn't manage to get it :( not even a stack trace. |
large-receive-offload off suspicious. Also RX checksumming looks off. @ybendito - ideas? |
Confirm that RSC is not working. Wireshark inside guest shows 1500/9000 sized packets depending on host mtu configuration. |
on host:
Is that what you meant? |
tcp-segmentation-offload should be "on" for the tap device. |
Yeah, I've tried setting that using ethtool but it isn't turning on though, where do you suggest I look at next? |
Do you see anything in "dmesg"? |
Nothing printed there, either. |
Yep. I think something's wrong with the TSO on this kernel/configuration. But in your case even with TSO off you're still going above 10gbit/s; I'm barely hitting 1gbit/s. I'm not sure where to look to figure out why TSO isn't turning on though. |
I've been digging the kernel code and the qemu code - I can confirm that tap devices can turn on TSO, just not the ones that are currently in use by the VMs/created by libvirt/qemu. @ybendito could you share your domain and network libvirt XML please? It looks like libvirt and qemu both have a role to play here in setting the tap device correctly. My bridge is created manually, but I tried a different domain with a network created by libvirt and TSO is still off there. |
@lowjoel My results are from plain command line qemu, no libvirt, just -tap,vhost=on,id=..,script= in the command line. Fedora 28, qemu ~6.1, kernel 5.12 |
could you paste that here and I'll try that as a minimal reproducer please? including how the tap is created? just in case I'm missing something |
@lowjoel Enjoy ) |
@ybendito and how was the tap created? on my fresh Ubuntu install using both |
@lowjoel qemu creates the tap (it rus as an admin). When created, qemu runs script as defined in script=/home/yurib/br0-ifup, the script is:
virbr0 is the libvirt bridge (so the device is behind local NAT) |
Bingo. It wasn't the host side, it's the guest side. Initially:
WFP is the Windows Filtering Platform. I guess it's the firewall. Disabled the firewall:
Wireshark now shows ~62k sized packets.
It's not 1:1 with sending, but 30% less than sending. I'll take it. Incidentally, after disabling the firewall:
I didn't expect that the guest can affect the host in this way. Can I help to update the wiki/docs as a form of expressing my thanks? 😄 I don't have permissions though. I will also reach out to the firewall vendor to ask. |
Guest is one who requests to enable/disable these options on the host tap. If the driver has started with RSC enabled, it can dynamically turn it on/off (Qemu configures the tap accordingly) and if the guest turned the RSC on we can turn it off/on in the tap. But if the guest started the device with RSC disabled - this means that the OS is not ready to receive coalesced packets (packet size > MTU), it this case you can't turn it on in the tap. Fortunately. |
Incidentally, for those who are seeing this:
It means that your libvirt/qemu command line is not enabling any of the offloads. Try the follow under the definition (for qemu-kvm): <driver name="vhost" txmode="iothread" ioeventfd="on" event_idx="on" queues="4" rx_queue_size="1024" tx_queue_size="1024">
<host csum="on" gso="on" tso4="on" tso6="on" ecn="on" ufo="on" mrg_rxbuf="on"/>
<guest csum="on" tso4="on" tso6="on" ecn="on" ufo="on"/>
</driver> |
As promised @YanVugenfirer @ybendito I've updated the wiki with the knowledge from this thread: https://github.com/lowjoel/kvm-guest-drivers-windows-wiki/compare/netkvm-rsc-docs Please feel free to integrate the updated docs into the wiki. And also feel free to close this issue since the problem is not with the netkvm driver. Thank you all once again for helping me! |
@lowjoel Thanks for the Wiki update! Just for the statistics - can you tell us why you are testing performance and how you are using the Virtio drivers? |
No problem. I have a workstation/server all-in-one setup at home. I use a Windows guest since I'm mostly familiar with it, but on the server side at $DAYJOB I'm more familiar with the Linux stack. The server's just a file server, and I have shares across the host/guest which is why I ran into this specific problem. I was testing performance because that specific share had my photos on it and transferring them for editing/publishing was unbearably slow 😅 |
Thanks! |
Hi Guys, I'm trying to reproduce this situation in our internal Host, making the Host as the Server role and the Guest as the Client role(BTW: Do I make a mistake with the server and client roles? Anyway, I want to share it with you to discuss them more.). Packages:
Reproduce Steps:
# iperf -s
# ethtool -k tap0 | head -n20
Features for tap0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
# ethtool -k virbr0 | head -n20
Features for virbr0:
rx-checksumming: off [fixed]
tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
C:\> netsh advfirewall set allprofiles state on
C:\> C:\Users\administrator\Desktop\iperf-2.1.6-win.exe -c %ip% -t 10
C:\> C:\Users\administrator\Desktop\iperf-2.1.6-win.exe -c %ip% -t 10 --reverse
I wrote the BAT script to show the detailed steps. @echo off
setlocal enabledelayedexpansion
:: Configuration
set ip=192.168.122.1
set logdir=iperf_logs
set num_tests=100
set iperf_tool=C:\Users\administrator\Desktop\iperf-2.1.6-win.exe
:: Create log folder
if not exist %logdir% mkdir %logdir%
:: Run tests
for /L %%i in (1,1,%num_tests%) do (
echo Running test %%i...
:: Enable firewall and test iperf --reverse
echo Enabling firewall and testing iperf --reverse >> %logdir%\test_%%i.txt
netsh advfirewall set allprofiles state on
%iperf_tool% -c %ip% -t 10 --reverse >> %logdir%\test_%%i.txt
powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append
:: Enable firewall and test iperf
echo Enabling firewall and testing iperf >> %logdir%\test_%%i.txt
%iperf_tool% -c %ip% -t 10 >> %logdir%\test_%%i.txt
powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append
:: Disable firewall and test iperf --reverse
echo Disabling firewall and testing iperf --reverse >> %logdir%\test_%%i.txt
netsh advfirewall set allprofiles state off
%iperf_tool% -c %ip% -t 10 --reverse >> %logdir%\test_%%i.txt
powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append
:: Disable firewall and test iperf
echo Disabling firewall and testing iperf >> %logdir%\test_%%i.txt
%iperf_tool% -c %ip% -t 10 >> %logdir%\test_%%i.txt
powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append
:: Add separator
echo ============================== >> %logdir%\test_%%i.txt
)
echo Testing completed. All logs have been saved in the %logdir% folder.
pause QEMU cmdline: # cat /home/wji/firewall.sh
/usr/libexec/qemu-kvm \
-name 'avocado-vt-vm1' \
-sandbox on,elevateprivileges=deny,obsolete=deny,resourcecontrol=deny \
-blockdev '{"node-name": "file_ovmf_code", "driver": "file", "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", "auto-read-only": true, "discard": "unmap"}' \
-blockdev '{"node-name": "drive_ovmf_code", "driver": "raw", "read-only": true, "file": "file_ovmf_code"}' \
-blockdev '{"node-name": "file_ovmf_vars", "driver": "file", "filename": "/root/avocado/data/avocado-vt/avocado-vt-vm1_win10-64-virtio-scsi-ovmf_qcow2_filesystem_VARS.raw", "auto-read-only": true, "discard": "unmap"}' \
-blockdev '{"node-name": "drive_ovmf_vars", "driver": "raw", "read-only": false, "file": "file_ovmf_vars"}' \
-machine q35,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars,memory-backend=mem-machine_mem \
-device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' \
-device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}' \
-nodefaults \
-device '{"driver": "VGA", "bus": "pcie.0", "addr": "0x2"}' \
-m 14336 \
-object '{"size": 15032385536, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}' \
-smp 16,maxcpus=32,cores=16,threads=1,dies=1,sockets=2 \
-cpu 'EPYC-Milan',x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,overflow-recov=on,succor=on,stibp-always-on=on,virt-ssbd=on,amd-psfd=on,lbrv=on,tsc-scale=on,vmcb-clean=on,flushbyasid=on,pause-filter=on,pfthreshold=on,v-vmsave-vmload=on,vgif=on,no-nested-data-bp=on,lfence-always-serializing=on,null-sel-clr-base=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,gds-no=on,rfds-no=on,erms=off,fsrm=off,hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,hv-xmm-input,hv_tlbflush_ext,kvm_pv_unhalt=on \
-device '{"ioport": 1285, "driver": "pvpanic", "id": "idmIt3Xu"}' \
-device '{"id": "pcie-root-port-1", "port": 1, "driver": "pcie-root-port", "addr": "0x1.0x1", "bus": "pcie.0", "chassis": 2}' \
-device '{"driver": "qemu-xhci", "id": "usb1", "bus": "pcie-root-port-1", "addr": "0x0"}' \
-device '{"driver": "usb-tablet", "id": "usb-tablet1", "bus": "usb1.0", "port": "1"}' \
-device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' \
-device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-2", "addr": "0x0"}' \
-blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/win10-64-virtio-scsi-ovmf.qcow2", "cache": {"direct": true, "no-flush": false}}' \
-blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
-device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \
-device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
-device virtio-net-pci,tx=bh,ioeventfd=on,event_idx=on,csum=on,gso=on,host_tso4=on,host_tso6=on,host_ecn=on,host_ufo=on,mrg_rxbuf=on,guest_csum=on,guest_tso4=on,guest_tso6=on,guest_ecn=on,guest_ufo=on,mq=on,vectors=18,rx_queue_size=1024,tx_queue_size=256,netdev=idCJE9Sq,mac=9a:19:f2:c3:bd:02,bus=pcie.0,addr=0x9,id=ideIaQ30 \
-netdev '{"id": "idCJE9Sq", "type": "tap", "vhost": true, "queues": 8}' \
-blockdev '{"node-name": "drive_cd1", "driver": "file", "read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/iso/windows/winutils.iso", "cache": {"direct": true, "no-flush": false}}' \
-device '{"driver": "scsi-cd", "id": "cd1", "drive": "drive_cd1", "write-cache": "on"}' \
-vnc 0.0.0.0:16 \
-rtc base=localtime,clock=host,driftfix=slew \
-boot menu=off,order=cdn,once=c,strict=off \
-chardev socket,id=char_vtpm_tpm0,path=/tmp/guest-swtpm16.sock \
-tpmdev emulator,chardev=char_vtpm_tpm0,id=emulator_vtpm_tpm0 \
-device tpm-crb,id=tpm-crb_vtpm_tpm0,tpmdev=emulator_vtpm_tpm0 \
-enable-kvm \
-device '{"id": "pcie_extra_root_port_0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x3", "chassis": 5}' I run it for the 100 times. Let's check the results. |
Describe the bug
iperf3 can send ~10gbit/s to the host from the guest on a single connection:
But less than 10% of that performance when receiving from the host:
Copying a file over the bridge but between 2 Windows VMs gives me ~1.6Gbit/s and doesn't experience the same issue.
To Reproduce
Steps to reproduce the behaviour:
My Windows iperf3 is on WSL1, so there's no Hyper-V layer in between (but I get to run iperf3). See #1026 (comment) for iperf3 using cygwin (no WSL)
I have got a few workarounds:
--parallel
for iperf. Or use SMB multichannel. Guest CPU is highly loaded during iperf run depending on parallelism (parallelism 4 = 4 loaded CPUs, parallelism 8 = 8 loaded CPUs). Notice how adding more parallelism has diminishing returns.Expected behavior
Send and receive performance should be similar. Maybe not 1:1 but <10% the performance shows something else is wrong here.
Screenshots
If applicable, add screenshots to help explain your problem.
Host:
(8 queues, 16 core machine).
VM:
Additional context
There is a bridge interface on the host, and a tap interface for the Windows guest.
I saw this doc: https://github.com/virtio-win/kvm-guest-drivers-windows/wiki/netkvm-RSC-(receive-segment-coalescing)-feature:
notice that
tcp-segmentation-offload: off
for the vnet device. Not sure if that's related.The text was updated successfully, but these errors were encountered: