Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netkvm: receive/transmit performance vastly different #1026

Open
lowjoel opened this issue Jan 15, 2024 · 30 comments
Open

netkvm: receive/transmit performance vastly different #1026

lowjoel opened this issue Jan 15, 2024 · 30 comments
Assignees
Labels

Comments

@lowjoel
Copy link

lowjoel commented Jan 15, 2024

Describe the bug

iperf3 can send ~10gbit/s to the host from the guest on a single connection:

$ iperf3 --time 30 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  34.3 GBytes  9.83 Gbits/sec  775435825             sender
[  5]   0.00-30.05  sec  34.3 GBytes  9.81 Gbits/sec                  receiver

But less than 10% of that performance when receiving from the host:

$ iperf3 --time 30 --reverse -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.06  sec  2.36 GBytes   675 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.36 GBytes   674 Mbits/sec                  receiver

Copying a file over the bridge but between 2 Windows VMs gives me ~1.6Gbit/s and doesn't experience the same issue.

To Reproduce
Steps to reproduce the behaviour:

My Windows iperf3 is on WSL1, so there's no Hyper-V layer in between (but I get to run iperf3). See #1026 (comment) for iperf3 using cygwin (no WSL)

I have got a few workarounds:

  1. Increase bridge and tap MTU to 9000.
$ iperf3 --time 30 --reverse -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  8.13 GBytes  2.32 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  8.12 GBytes  2.33 Gbits/sec                  receiver
  1. Use --parallel for iperf. Or use SMB multichannel. Guest CPU is highly loaded during iperf run depending on parallelism (parallelism 4 = 4 loaded CPUs, parallelism 8 = 8 loaded CPUs). Notice how adding more parallelism has diminishing returns.
$ iperf3 --time 30 --reverse --parallel 4 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  1.66 GBytes   474 Mbits/sec  101             sender
[  5]   0.00-30.00  sec  1.66 GBytes   474 Mbits/sec                  receiver
[  7]   0.00-30.04  sec   907 MBytes   253 Mbits/sec  145             sender
[  7]   0.00-30.00  sec   903 MBytes   252 Mbits/sec                  receiver
[  9]   0.00-30.04  sec  1.37 GBytes   392 Mbits/sec   68             sender
[  9]   0.00-30.00  sec  1.37 GBytes   392 Mbits/sec                  receiver
[ 11]   0.00-30.04  sec  1.40 GBytes   400 Mbits/sec   45             sender
[ 11]   0.00-30.00  sec  1.40 GBytes   399 Mbits/sec                  receiver
[SUM]   0.00-30.04  sec  5.31 GBytes  1.52 Gbits/sec  359             sender
[SUM]   0.00-30.00  sec  5.30 GBytes  1.52 Gbits/sec                  receiver
$ iperf3 --time 30 --reverse --parallel 8 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  1.51 GBytes   431 Mbits/sec  1449             sender
[  5]   0.00-30.00  sec  1.50 GBytes   431 Mbits/sec                  receiver
[  7]   0.00-30.04  sec   913 MBytes   255 Mbits/sec  982             sender
[  7]   0.00-30.00  sec   909 MBytes   254 Mbits/sec                  receiver
[  9]   0.00-30.04  sec   906 MBytes   253 Mbits/sec  1271             sender
[  9]   0.00-30.00  sec   902 MBytes   252 Mbits/sec                  receiver
[ 11]   0.00-30.04  sec   884 MBytes   247 Mbits/sec  1174             sender
[ 11]   0.00-30.00  sec   880 MBytes   246 Mbits/sec                  receiver
[ 13]   0.00-30.04  sec   974 MBytes   272 Mbits/sec  1462             sender
[ 13]   0.00-30.00  sec   971 MBytes   272 Mbits/sec                  receiver
[ 15]   0.00-30.04  sec   890 MBytes   249 Mbits/sec  1324             sender
[ 15]   0.00-30.00  sec   887 MBytes   248 Mbits/sec                  receiver
[ 17]   0.00-30.04  sec   914 MBytes   255 Mbits/sec  1316             sender
[ 17]   0.00-30.00  sec   910 MBytes   255 Mbits/sec                  receiver
[ 19]   0.00-30.04  sec  1.68 GBytes   481 Mbits/sec  1281             sender
[ 19]   0.00-30.00  sec  1.68 GBytes   480 Mbits/sec                  receiver
[SUM]   0.00-30.04  sec  8.54 GBytes  2.44 Gbits/sec  10259             sender
[SUM]   0.00-30.00  sec  8.51 GBytes  2.44 Gbits/sec                  receiver
  1. Both:
$ iperf3 --time 30 --reverse --parallel 4 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.04  sec  5.71 GBytes  1.63 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  5.70 GBytes  1.63 Gbits/sec                  receiver
[  7]   0.00-30.04  sec  5.72 GBytes  1.64 Gbits/sec    0             sender
[  7]   0.00-30.00  sec  5.71 GBytes  1.64 Gbits/sec                  receiver
[  9]   0.00-30.04  sec  6.07 GBytes  1.73 Gbits/sec    0             sender
[  9]   0.00-30.00  sec  6.06 GBytes  1.74 Gbits/sec                  receiver
[ 11]   0.00-30.04  sec  6.07 GBytes  1.74 Gbits/sec    0             sender
[ 11]   0.00-30.00  sec  6.07 GBytes  1.74 Gbits/sec                  receiver
[SUM]   0.00-30.04  sec  23.6 GBytes  6.74 Gbits/sec    0             sender
[SUM]   0.00-30.00  sec  23.5 GBytes  6.74 Gbits/sec                  receiver
$ iperf3 --time 30 --reverse --parallel 8 -c HOST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.02  sec  4.07 GBytes  1.16 Gbits/sec  357             sender
[  5]   0.00-30.00  sec  4.06 GBytes  1.16 Gbits/sec                  receiver
[  7]   0.00-30.02  sec  3.87 GBytes  1.11 Gbits/sec  464             sender
[  7]   0.00-30.00  sec  3.86 GBytes  1.11 Gbits/sec                  receiver
[  9]   0.00-30.02  sec  3.88 GBytes  1.11 Gbits/sec  568             sender
[  9]   0.00-30.00  sec  3.87 GBytes  1.11 Gbits/sec                  receiver
[ 11]   0.00-30.02  sec  4.33 GBytes  1.24 Gbits/sec  578             sender
[ 11]   0.00-30.00  sec  4.33 GBytes  1.24 Gbits/sec                  receiver
[ 13]   0.00-30.02  sec  4.64 GBytes  1.33 Gbits/sec    0             sender
[ 13]   0.00-30.00  sec  4.64 GBytes  1.33 Gbits/sec                  receiver
[ 15]   0.00-30.02  sec  4.65 GBytes  1.33 Gbits/sec    0             sender
[ 15]   0.00-30.00  sec  4.65 GBytes  1.33 Gbits/sec                  receiver
[ 17]   0.00-30.02  sec  4.20 GBytes  1.20 Gbits/sec  426             sender
[ 17]   0.00-30.00  sec  4.20 GBytes  1.20 Gbits/sec                  receiver
[ 19]   0.00-30.02  sec  4.17 GBytes  1.19 Gbits/sec  588             sender
[ 19]   0.00-30.00  sec  4.16 GBytes  1.19 Gbits/sec                  receiver
[SUM]   0.00-30.02  sec  33.8 GBytes  9.67 Gbits/sec  2981             sender
[SUM]   0.00-30.00  sec  33.8 GBytes  9.67 Gbits/sec                  receiver

Expected behavior
Send and receive performance should be similar. Maybe not 1:1 but <10% the performance shows something else is wrong here.

Screenshots
If applicable, add screenshots to help explain your problem.

Host:

  • Distro: [e.g. Fedora, Ubuntu, Proxmox] Ubuntu 22.04
  • Kernel version Linux 6.5 (HWE)
  • QEMU version qemu 6.2
  • QEMU command line
/usr/bin/qemu-system-x86_64 ... \
-accel kvm \
-cpu host,migratable=off,topoext=on,svm=on,invtsc=on,x2apic=on,hv-time=on,hv-relaxed=on,hv-vapic=on,hv-spinlocks=0x1fff,hv-vpindex=on,hv-runtime=on,hv-synic=on,hv-stimer=on,hv-stimer-direct=on,hv-reset=on,hv-vendor-id=1234567890ab,hv-frequencies=on,hv-reenlightenment=on,hv-tlbflush=on,hv-ipi=on,kvm=off,host-cache-info=on,l3-cache=off ... \
-netdev tap,fds=58:62:63:64:65:66:69:70,id=hostnet0,vhost=on,vhostfds=71:72:73:74:75:76:77:78 \
-device virtio-net-pci,tx=bh,ioeventfd=on,event_idx=on,csum=on,gso=on,host_tso4=on,host_tso6=on,host_ecn=on,host_ufo=on,mrg_rxbuf=on,guest_csum=on,guest_tso4=on,guest_tso6=on,guest_ecn=on,guest_ufo=on,mq=on,vectors=18,rx_queue_size=1024,tx_queue_size=1024,netdev=hostnet0,id=net0,mac=<snip>,bus=pci.1,addr=0x0.0x7 ...

(8 queues, 16 core machine).

  • libvirt version 8.0.0
  • libvirt XML file
    <interface type='bridge'>
      <mac address='SNIP'/>
      <source bridge='SNIP'/>
      <model type='virtio'/>
      <driver name='vhost' txmode='iothread' ioeventfd='on' event_idx='on' queues='8' rx_queue_size='1024' tx_queue_size='1024'>
        <host csum='on' gso='on' tso4='on' tso6='on' ecn='on' ufo='on' mrg_rxbuf='on'/>
        <guest csum='on' tso4='on' tso6='on' ecn='on' ufo='on'/>
      </driver>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x7'/>
    </interface>

VM:

  • Windows version Windows 10 22H2
  • Which driver has a problem using netkvm version 100.93.104.24000
  • Driver version or commit hash that was used to build the driver

Additional context

There is a bridge interface on the host, and a tap interface for the Windows guest.

I saw this doc: https://github.com/virtio-win/kvm-guest-drivers-windows/wiki/netkvm-RSC-(receive-segment-coalescing)-feature:

$ ethtool -k BRIDGE
Features for BRIDGE:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: on
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: on [fixed]
tx-gso-robust: off [requested on]
tx-fcoe-segmentation: off [requested on]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-tunnel-remcsum-segmentation: on
tx-sctp-segmentation: off [requested on]
tx-esp-segmentation: on
tx-udp-segmentation: off [requested on]
tx-gso-list: off [requested on]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]
$  ethtool -k vnet49
Features for vnet49:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: off
        tx-tcp-segmentation: off
        tx-tcp-ecn-segmentation: off
        tx-tcp-mangleid-segmentation: off
        tx-tcp6-segmentation: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: on [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

notice that tcp-segmentation-offload: off for the vnet device. Not sure if that's related.

@YanVugenfirer
Copy link
Collaborator

I suggest to ran as a benchmark first iperf in the guest without WSL1.
If there are issues, we will dig in. But if it is WSL tap issues, we can at best give some advices for that to look.

Best regards,
Yan.

@YanVugenfirer
Copy link
Collaborator

Another important comment: please run test with one stream.
WSL1 definitely is not supporting multi-queue.

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Wait, there's Windows binaries for iperf? Haha. I'll try that.

iperf is actually to make it more reproducible. What started this was my copies over Samba being slow from Host to Guest.

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

OK I've had to flip the server/client (run the iperf client on the host, server in the guest), but the results are the same. I used this binary without WSL: https://iperf.fr/iperf-download.php

MTU=1500

$  iperf3 --time 30 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  2.43 GBytes   695 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.43 GBytes   695 Mbits/sec                  receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  2.26 GBytes   647 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  2.26 GBytes   646 Mbits/sec                  receiver
[  7]   0.00-30.00  sec  2.18 GBytes   624 Mbits/sec    0             sender
[  7]   0.00-30.00  sec  2.18 GBytes   623 Mbits/sec                  receiver
[  9]   0.00-30.00  sec  2.45 GBytes   700 Mbits/sec    0             sender
[  9]   0.00-30.00  sec  2.44 GBytes   699 Mbits/sec                  receiver
[ 11]   0.00-30.00  sec  2.51 GBytes   719 Mbits/sec    0             sender
[ 11]   0.00-30.00  sec  2.51 GBytes   718 Mbits/sec                  receiver
[SUM]   0.00-30.00  sec  9.40 GBytes  2.69 Gbits/sec    0             sender
[SUM]   0.00-30.00  sec  9.38 GBytes  2.69 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  1.10 GBytes   315 Mbits/sec    0             sender
[  5]   0.00-30.00  sec  1.10 GBytes   315 Mbits/sec                  receiver
[  7]   0.00-30.00  sec  1.80 GBytes   517 Mbits/sec    0             sender
[  7]   0.00-30.00  sec  1.80 GBytes   516 Mbits/sec                  receiver
[  9]   0.00-30.00  sec  2.07 GBytes   594 Mbits/sec    0             sender
[  9]   0.00-30.00  sec  2.07 GBytes   592 Mbits/sec                  receiver
[ 11]   0.00-30.00  sec  2.06 GBytes   591 Mbits/sec    0             sender
[ 11]   0.00-30.00  sec  2.06 GBytes   590 Mbits/sec                  receiver
[ 13]   0.00-30.00  sec  1.08 GBytes   310 Mbits/sec    0             sender
[ 13]   0.00-30.00  sec  1.08 GBytes   309 Mbits/sec                  receiver
[ 15]   0.00-30.00  sec  1.08 GBytes   309 Mbits/sec    1             sender
[ 15]   0.00-30.00  sec  1.07 GBytes   308 Mbits/sec                  receiver
[ 17]   0.00-30.00  sec  1.10 GBytes   314 Mbits/sec    1             sender
[ 17]   0.00-30.00  sec  1.09 GBytes   313 Mbits/sec                  receiver
[ 19]   0.00-30.00  sec  2.13 GBytes   610 Mbits/sec    0             sender
[ 19]   0.00-30.00  sec  2.13 GBytes   609 Mbits/sec                  receiver
[SUM]   0.00-30.00  sec  12.4 GBytes  3.56 Gbits/sec    2             sender
[SUM]   0.00-30.00  sec  12.4 GBytes  3.55 Gbits/sec                  receiver

I had a kernel panic when first running parallelism=8. Second run was OK. There's still nonlinear scaling.

MTU=9000

$  iperf3 --time 30 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  17.2 GBytes  4.94 Gbits/sec    0             sender
[  5]   0.00-30.00  sec  17.2 GBytes  4.94 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 4 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  10.1 GBytes  2.88 Gbits/sec    1             sender
[  5]   0.00-30.00  sec  10.1 GBytes  2.88 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  10.8 GBytes  3.10 Gbits/sec    0             sender
[  7]   0.00-30.00  sec  10.8 GBytes  3.10 Gbits/sec                  receiver
[  9]   0.00-30.00  sec  10.2 GBytes  2.92 Gbits/sec    2             sender
[  9]   0.00-30.00  sec  10.2 GBytes  2.92 Gbits/sec                  receiver
[ 11]   0.00-30.00  sec  10.5 GBytes  3.00 Gbits/sec    2             sender
[ 11]   0.00-30.00  sec  10.5 GBytes  3.00 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec  41.6 GBytes  11.9 Gbits/sec    5             sender
[SUM]   0.00-30.00  sec  41.6 GBytes  11.9 Gbits/sec                  receiver
$ iperf3 --time 30 --parallel 8 -c GUEST
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec    1             sender
[  5]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec                  receiver
[  7]   0.00-30.00  sec  5.47 GBytes  1.57 Gbits/sec    1             sender
[  7]   0.00-30.00  sec  5.47 GBytes  1.57 Gbits/sec                  receiver
[  9]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec    1             sender
[  9]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec                  receiver
[ 11]   0.00-30.00  sec  5.11 GBytes  1.46 Gbits/sec    1             sender
[ 11]   0.00-30.00  sec  5.11 GBytes  1.46 Gbits/sec                  receiver
[ 13]   0.00-30.00  sec  5.33 GBytes  1.53 Gbits/sec    0             sender
[ 13]   0.00-30.00  sec  5.33 GBytes  1.53 Gbits/sec                  receiver
[ 15]   0.00-30.00  sec  5.07 GBytes  1.45 Gbits/sec    2             sender
[ 15]   0.00-30.00  sec  5.07 GBytes  1.45 Gbits/sec                  receiver
[ 17]   0.00-30.00  sec  5.40 GBytes  1.55 Gbits/sec    0             sender
[ 17]   0.00-30.00  sec  5.40 GBytes  1.54 Gbits/sec                  receiver
[ 19]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec    2             sender
[ 19]   0.00-30.00  sec  4.81 GBytes  1.38 Gbits/sec                  receiver
[SUM]   0.00-30.00  sec  41.4 GBytes  11.9 Gbits/sec    8             sender
[SUM]   0.00-30.00  sec  41.4 GBytes  11.9 Gbits/sec                  receiver

The CPU load with parallel iperf3 is the same - parallelism 4 = 4 cpus at 100%, parallelism 8 = 8 cpus at 100% etc.

@YanVugenfirer
Copy link
Collaborator

Can you please share the crash dump?

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Unfortunately I didn't manage to get it :( not even a stack trace.

@YanVugenfirer
Copy link
Collaborator

large-receive-offload off suspicious. Also RX checksumming looks off.
Can you record session with Wireshark? Just to check the TCP packet sizes on receive. With functional RSC (receive side coalescing, that should boost RX performance) we should see 64K packets (or in any case larger than MTU packets)

@ybendito - ideas?

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Confirm that RSC is not working. Wireshark inside guest shows 1500/9000 sized packets depending on host mtu configuration.

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

on host:

$ sudo ethtool -K vnet4 tso on
Actual changes:
tx-tcp-segmentation: off [requested on]
tx-tcp-ecn-segmentation: off [requested on]
tx-tcp-mangleid-segmentation: off [requested on]
tx-tcp6-segmentation: off [requested on]
Could not change any device features
$ ethtool -c vnet4
Coalesce parameters for vnet4:
Adaptive RX: n/a  TX: n/a
stats-block-usecs: n/a
sample-interval: n/a
pkt-rate-low: n/a
pkt-rate-high: n/a

rx-usecs: n/a
rx-frames: 0
rx-usecs-irq: n/a
rx-frames-irq: n/a

tx-usecs: n/a
tx-frames: n/a
tx-usecs-irq: n/a
tx-frames-irq: n/a

rx-usecs-low: n/a
rx-frame-low: n/a
tx-usecs-low: n/a
tx-frame-low: n/a

rx-usecs-high: n/a
rx-frame-high: n/a
tx-usecs-high: n/a
tx-frame-high: n/a

CQE mode RX: n/a  TX: n/a

Is that what you meant?

@YanVugenfirer
Copy link
Collaborator

tcp-segmentation-offload should be "on" for the tap device.

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Yeah, I've tried setting that using ethtool but it isn't turning on though, where do you suggest I look at next?

@YanVugenfirer
Copy link
Collaborator

Do you see anything in "dmesg"?

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Nothing printed there, either.

@ybendito
Copy link
Collaborator

image

@lowjoel
Copy link
Author

lowjoel commented Jan 15, 2024

Yep. I think something's wrong with the TSO on this kernel/configuration. But in your case even with TSO off you're still going above 10gbit/s; I'm barely hitting 1gbit/s.

I'm not sure where to look to figure out why TSO isn't turning on though.

@lowjoel
Copy link
Author

lowjoel commented Jan 16, 2024

I've been digging the kernel code and the qemu code - I can confirm that tap devices can turn on TSO, just not the ones that are currently in use by the VMs/created by libvirt/qemu. @ybendito could you share your domain and network libvirt XML please? It looks like libvirt and qemu both have a role to play here in setting the tap device correctly. My bridge is created manually, but I tried a different domain with a network created by libvirt and TSO is still off there.

@ybendito
Copy link
Collaborator

@lowjoel My results are from plain command line qemu, no libvirt, just -tap,vhost=on,id=..,script= in the command line. Fedora 28, qemu ~6.1, kernel 5.12

@lowjoel
Copy link
Author

lowjoel commented Jan 16, 2024

could you paste that here and I'll try that as a minimal reproducer please? including how the tap is created? just in case I'm missing something

@ybendito
Copy link
Collaborator

@lowjoel Enjoy )
sudo /home/yurib/src/qemu/build/qemu-system-x86_64 -machine q35,accel=kvm --snapshot --trace events=/home/yurib/qemu-events-en -cpu SandyBridge,+kvm_pv_unhalt,hv_spinlocks=0x1fff,hv_relaxed,hv_vapic,hv_time -m 8192 -smp 4 -uuid 1534fa42-4818-4493-9f67-eee5ba758385 -no-user-config -nodefaults -no-hpet -monitor stdio -device ioh3420,bus=pcie.0,id=root0,chassis=1,addr=0xa.0 -device ioh3420,bus=pcie.0,id=root1,chassis=2,addr=0xb.0 -device ioh3420,bus=pcie.0,id=root2,chassis=3,addr=0xc.0 -device ioh3420,bus=pcie.0,id=root3,chassis=4,addr=0xd.0 -device ioh3420,bus=pcie.0,id=root4,chassis=5,addr=0xe.0 -device ioh3420,bus=pcie.0,id=root5,chassis=6,addr=0xf.0 -global ICH9-LPC.disable_s3=0 -global ICH9-LPC.disable_s4=1 -device ahci,id=ahci -device virtio-serial-pci,bus=root1,id=virtio-serial0,max_ports=4,iommu_platform=on,ats=on -chardev spicevmc,name=vdagent,id=vdagent -device virtserialport,nr=2,bus=virtio-serial0.0,chardev=vdagent,name=com.redhat.spice.0 -chardev socket,id=serialp2,host=0.0.0.0,port=50000,server=on,wait=no -device virtserialport,nr=1,bus=virtio-serial0.0,chardev=serialp2,name=test.0 -netdev tap,id=hostnet10sb,script=/home/yurib/br0-ifup,ifname=nw10sb,vhost=on -device virtio-net-pci,netdev=hostnet10sb,mac=04:54:13:05:10:38,bus=root0,id=poc2,rss=on -device virtio-balloon-pci,bus=root4,iommu_platform=on,ats=on -drive file=/images/vms/2019-q35-usb.qcow2,if=none,id=drive-ide-3,media=disk,format=qcow2,cache=unsafe -device ide-hd,drive=drive-ide-3,id=ide3,bus=ahci.0,bootindex=0 -drive file=/images/iso/ubuntu-18.04.1-desktop-amd64.iso,if=none,id=drive-cd,media=cdrom,format=raw -device qemu-xhci,p2=8,p3=8 -device usb-tablet -device usb-storage,drive=drive-cd,id=xx3,bootindex=1 -vga std -vnc :1 -chardev spicevmc,name=usbredir,id=usbredirchardev1 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev1,id=usbredirdev1 -chardev spicevmc,name=usbredir,id=usbredirchardev2 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev2,id=usbredirdev2 -chardev spicevmc,name=usbredir,id=usbredirchardev3 -device usb-redir,filter=-1:-1:-1:-1:1,chardev=usbredirchardev3,id=usbredirdev3 -boot menu=on

@lowjoel
Copy link
Author

lowjoel commented Jan 16, 2024

@ybendito and how was the tap created? on my fresh Ubuntu install using both ip tuntap add mode tap pi vnet_hdr and using tunctl both create tap devices that I can't enable TSO on. I'm guessing you aren't seeing that same behaviour on your Fedora machine?

@ybendito
Copy link
Collaborator

ybendito commented Jan 16, 2024

@lowjoel qemu creates the tap (it rus as an admin). When created, qemu runs script as defined in script=/home/yurib/br0-ifup, the script is:

switch=virbr0
ifconfig $1 promisc 0.0.0.0
brctl addif ${switch} $1

virbr0 is the libvirt bridge (so the device is behind local NAT)

Let's see what happens under libvirt:
image
The RSC works.

What says your powershell on the guest?
image

@lowjoel
Copy link
Author

lowjoel commented Jan 17, 2024

Bingo. It wasn't the host side, it's the guest side.

Initially:

> get-netadapterrsc | format-list


Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : True
IPv6Supported        : True
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason    : WFPCompatibility
IPv6FailureReason    : WFPCompatibility

WFP is the Windows Filtering Platform. I guess it's the firewall. Disabled the firewall:

> get-netadapterrsc | format-list


Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : True
IPv6Supported        : True
IPv4OperationalState : True
IPv6OperationalState : True
IPv4FailureReason    : NoFailure
IPv6FailureReason    : NoFailure

Wireshark now shows ~62k sized packets.

- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-30.00  sec  23.6 GBytes  6.77 Gbits/sec    4             sender
[  5]   0.00-30.00  sec  23.6 GBytes  6.77 Gbits/sec                  receiver

It's not 1:1 with sending, but 30% less than sending. I'll take it.

Incidentally, after disabling the firewall:

Features for vnet4:
rx-checksumming: off [fixed]
tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off [fixed]
        tx-checksum-sctp: off [fixed]
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: off
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on

I didn't expect that the guest can affect the host in this way. Can I help to update the wiki/docs as a form of expressing my thanks? 😄 I don't have permissions though. I will also reach out to the firewall vendor to ask.

@ybendito
Copy link
Collaborator

ybendito commented Jan 17, 2024

I didn't expect that the guest can affect the host in this way

Guest is one who requests to enable/disable these options on the host tap. If the driver has started with RSC enabled, it can dynamically turn it on/off (Qemu configures the tap accordingly) and if the guest turned the RSC on we can turn it off/on in the tap. But if the guest started the device with RSC disabled - this means that the OS is not ready to receive coalesced packets (packet size > MTU), it this case you can't turn it on in the tap. Fortunately.

@lowjoel
Copy link
Author

lowjoel commented Jan 17, 2024

Incidentally, for those who are seeing this:

> get-netadapterrsc | format-list


Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : False
IPv6Supported        : False
IPv4OperationalState : False
IPv6OperationalState : False
IPv4FailureReason    : Capability
IPv6FailureReason    : Capability

It means that your libvirt/qemu command line is not enabling any of the offloads. Try the follow under the definition (for qemu-kvm):

<driver name="vhost" txmode="iothread" ioeventfd="on" event_idx="on" queues="4" rx_queue_size="1024" tx_queue_size="1024">
    <host csum="on" gso="on" tso4="on" tso6="on" ecn="on" ufo="on" mrg_rxbuf="on"/>
    <guest csum="on" tso4="on" tso6="on" ecn="on" ufo="on"/>
</driver>

@lowjoel
Copy link
Author

lowjoel commented Jan 17, 2024

As promised @YanVugenfirer @ybendito I've updated the wiki with the knowledge from this thread: https://github.com/lowjoel/kvm-guest-drivers-windows-wiki/compare/netkvm-rsc-docs

Please feel free to integrate the updated docs into the wiki. And also feel free to close this issue since the problem is not with the netkvm driver. Thank you all once again for helping me!

@YanVugenfirer
Copy link
Collaborator

@lowjoel Thanks for the Wiki update!

Just for the statistics - can you tell us why you are testing performance and how you are using the Virtio drivers?

@lowjoel
Copy link
Author

lowjoel commented Jan 30, 2024

No problem. I have a workstation/server all-in-one setup at home. I use a Windows guest since I'm mostly familiar with it, but on the server side at $DAYJOB I'm more familiar with the Linux stack. The server's just a file server, and I have shares across the host/guest which is why I ran into this specific problem.

I was testing performance because that specific share had my photos on it and transferring them for editing/publishing was unbearably slow 😅

@YanVugenfirer
Copy link
Collaborator

Thanks!

@heywji
Copy link

heywji commented Dec 20, 2024

Hi Guys,

I'm trying to reproduce this situation in our internal Host, making the Host as the Server role and the Guest as the Client role(BTW: Do I make a mistake with the server and client roles? Anyway, I want to share it with you to discuss them more.).
But I don't see the Host affected by the Guest (e.g. TSO is always supported on the Host side. No matter how the Guest turns off the firewall or not.), and I don't find the receive/transmit performance as vastly different in my test logs.

Packages:

  • virtio-win-prewhql-0.1-269
  • kernel-5.14.0-539.el9.x86_64
  • edk2-ovmf-20240524-7.el9.noarch
  • swtpm-0.8.0-2.el9_4.x86_64
  • qemu-kvm-core-9.1.0-5.el9.x86_64
  • Guest OS: Win10 Enterprise Build 19045

Reproduce Steps:

  1. Run iperf -s on the host side.
# iperf -s
  1. Set and check the TSO on the host side.
# ethtool -k tap0 | head -n20
Features for tap0:
rx-checksumming: off [fixed]
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: off [requested on]
	tx-tcp-mangleid-segmentation: on
	tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
# ethtool -k virbr0 | head -n20
Features for virbr0:
rx-checksumming: off [fixed]
tx-checksumming: on
	tx-checksum-ipv4: off [fixed]
	tx-checksum-ip-generic: on
	tx-checksum-ipv6: off [fixed]
	tx-checksum-fcoe-crc: off [fixed]
	tx-checksum-sctp: off [fixed]
scatter-gather: on
	tx-scatter-gather: on
	tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
	tx-tcp-segmentation: on
	tx-tcp-ecn-segmentation: on
	tx-tcp-mangleid-segmentation: on
	tx-tcp6-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
  1. Open the Windows firewall on the guest side.
C:\> netsh advfirewall set allprofiles state on
  1. Do an iperf Net-Stream test on the guest side.
C:\>  C:\Users\administrator\Desktop\iperf-2.1.6-win.exe  -c %ip% -t 10
  1. Do an iperf Net-Stream test on the guest side with "--reverse."
C:\>  C:\Users\administrator\Desktop\iperf-2.1.6-win.exe  -c %ip% -t 10 --reverse
  1. Check the NetAdapterRsc status by PowerShell.
C:\> Get-NetAdapterRsc | Format-List
Name                 : Ethernet
InterfaceDescription : Red Hat VirtIO Ethernet Adapter #2
IPv4Enabled          : True
IPv6Enabled          : True
IPv4Supported        : True
IPv6Supported        : True
IPv4OperationalState : True
IPv6OperationalState : True
IPv4FailureReason    : NoFailure
IPv6FailureReason    : NoFailure
  1. Do the above commands again with the Windows firewall opened.
balabala

I wrote the BAT script to show the detailed steps.

@echo off
setlocal enabledelayedexpansion

:: Configuration
set ip=192.168.122.1
set logdir=iperf_logs
set num_tests=100
set iperf_tool=C:\Users\administrator\Desktop\iperf-2.1.6-win.exe

:: Create log folder
if not exist %logdir% mkdir %logdir%

:: Run tests
for /L %%i in (1,1,%num_tests%) do (
    echo Running test %%i...

    :: Enable firewall and test iperf --reverse
    echo Enabling firewall and testing iperf --reverse >> %logdir%\test_%%i.txt
    netsh advfirewall set allprofiles state on
    %iperf_tool% -c %ip% -t 10 --reverse >> %logdir%\test_%%i.txt
    powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append

    :: Enable firewall and test iperf
    echo Enabling firewall and testing iperf >> %logdir%\test_%%i.txt
    %iperf_tool% -c %ip% -t 10 >> %logdir%\test_%%i.txt
    powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append

    :: Disable firewall and test iperf --reverse
    echo Disabling firewall and testing iperf --reverse >> %logdir%\test_%%i.txt
    netsh advfirewall set allprofiles state off
    %iperf_tool% -c %ip% -t 10 --reverse >> %logdir%\test_%%i.txt
    powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append

    :: Disable firewall and test iperf
    echo Disabling firewall and testing iperf >> %logdir%\test_%%i.txt
    %iperf_tool% -c %ip% -t 10 >> %logdir%\test_%%i.txt
    powershell -c "Get-NetAdapterRsc | Format-List | Out-File -FilePath %logdir%\test_%%i.txt" -Append

    :: Add separator
    echo ============================== >> %logdir%\test_%%i.txt
)

echo Testing completed. All logs have been saved in the %logdir% folder.
pause

QEMU cmdline:

# cat /home/wji/firewall.sh
/usr/libexec/qemu-kvm \
	-name 'avocado-vt-vm1'  \
	-sandbox on,elevateprivileges=deny,obsolete=deny,resourcecontrol=deny \
	-blockdev '{"node-name": "file_ovmf_code", "driver": "file", "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd", "auto-read-only": true, "discard": "unmap"}' \
	-blockdev '{"node-name": "drive_ovmf_code", "driver": "raw", "read-only": true, "file": "file_ovmf_code"}' \
	-blockdev '{"node-name": "file_ovmf_vars", "driver": "file", "filename": "/root/avocado/data/avocado-vt/avocado-vt-vm1_win10-64-virtio-scsi-ovmf_qcow2_filesystem_VARS.raw", "auto-read-only": true, "discard": "unmap"}' \
	-blockdev '{"node-name": "drive_ovmf_vars", "driver": "raw", "read-only": false, "file": "file_ovmf_vars"}' \
	-machine q35,pflash0=drive_ovmf_code,pflash1=drive_ovmf_vars,memory-backend=mem-machine_mem \
	-device '{"id": "pcie-root-port-0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x1", "chassis": 1}' \
	-device '{"id": "pcie-pci-bridge-0", "driver": "pcie-pci-bridge", "addr": "0x0", "bus": "pcie-root-port-0"}'  \
	-nodefaults \
	-device '{"driver": "VGA", "bus": "pcie.0", "addr": "0x2"}' \
	-m 14336 \
	-object '{"size": 15032385536, "id": "mem-machine_mem", "qom-type": "memory-backend-ram"}'  \
	-smp 16,maxcpus=32,cores=16,threads=1,dies=1,sockets=2  \
	-cpu 'EPYC-Milan',x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,overflow-recov=on,succor=on,stibp-always-on=on,virt-ssbd=on,amd-psfd=on,lbrv=on,tsc-scale=on,vmcb-clean=on,flushbyasid=on,pause-filter=on,pfthreshold=on,v-vmsave-vmload=on,vgif=on,no-nested-data-bp=on,lfence-always-serializing=on,null-sel-clr-base=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,gds-no=on,rfds-no=on,erms=off,fsrm=off,hv_stimer,hv_synic,hv_vpindex,hv_relaxed,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_frequencies,hv_runtime,hv_tlbflush,hv_reenlightenment,hv_stimer_direct,hv_ipi,hv-xmm-input,hv_tlbflush_ext,kvm_pv_unhalt=on \
	-device '{"ioport": 1285, "driver": "pvpanic", "id": "idmIt3Xu"}' \
	-device '{"id": "pcie-root-port-1", "port": 1, "driver": "pcie-root-port", "addr": "0x1.0x1", "bus": "pcie.0", "chassis": 2}' \
	-device '{"driver": "qemu-xhci", "id": "usb1", "bus": "pcie-root-port-1", "addr": "0x0"}' \
	-device '{"driver": "usb-tablet", "id": "usb-tablet1", "bus": "usb1.0", "port": "1"}' \
	-device '{"id": "pcie-root-port-2", "port": 2, "driver": "pcie-root-port", "addr": "0x1.0x2", "bus": "pcie.0", "chassis": 3}' \
	-device '{"id": "virtio_scsi_pci0", "driver": "virtio-scsi-pci", "bus": "pcie-root-port-2", "addr": "0x0"}' \
	-blockdev '{"node-name": "file_image1", "driver": "file", "auto-read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/images/win10-64-virtio-scsi-ovmf.qcow2", "cache": {"direct": true, "no-flush": false}}' \
	-blockdev '{"node-name": "drive_image1", "driver": "qcow2", "read-only": false, "cache": {"direct": true, "no-flush": false}, "file": "file_image1"}' \
	-device '{"driver": "scsi-hd", "id": "image1", "drive": "drive_image1", "write-cache": "on"}' \
	-device '{"id": "pcie-root-port-3", "port": 3, "driver": "pcie-root-port", "addr": "0x1.0x3", "bus": "pcie.0", "chassis": 4}' \
	-device virtio-net-pci,tx=bh,ioeventfd=on,event_idx=on,csum=on,gso=on,host_tso4=on,host_tso6=on,host_ecn=on,host_ufo=on,mrg_rxbuf=on,guest_csum=on,guest_tso4=on,guest_tso6=on,guest_ecn=on,guest_ufo=on,mq=on,vectors=18,rx_queue_size=1024,tx_queue_size=256,netdev=idCJE9Sq,mac=9a:19:f2:c3:bd:02,bus=pcie.0,addr=0x9,id=ideIaQ30 \
	-netdev  '{"id": "idCJE9Sq", "type": "tap", "vhost": true, "queues": 8}' \
	-blockdev '{"node-name": "drive_cd1", "driver": "file", "read-only": true, "discard": "unmap", "aio": "threads", "filename": "/home/kvm_autotest_root/iso/windows/winutils.iso", "cache": {"direct": true, "no-flush": false}}' \
	-device '{"driver": "scsi-cd", "id": "cd1", "drive": "drive_cd1", "write-cache": "on"}'  \
	-vnc 0.0.0.0:16  \
	-rtc base=localtime,clock=host,driftfix=slew  \
	-boot menu=off,order=cdn,once=c,strict=off \
	-chardev socket,id=char_vtpm_tpm0,path=/tmp/guest-swtpm16.sock \
	-tpmdev emulator,chardev=char_vtpm_tpm0,id=emulator_vtpm_tpm0  \
	-device tpm-crb,id=tpm-crb_vtpm_tpm0,tpmdev=emulator_vtpm_tpm0 \
	-enable-kvm \
	-device '{"id": "pcie_extra_root_port_0", "driver": "pcie-root-port", "multifunction": true, "bus": "pcie.0", "addr": "0x3", "chassis": 5}'

I run it for the 100 times. Let's check the results.
WFP.zip
There may be something wrong here. Please correct me if I am wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants