Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P999 may not be accurate if Never Send is high. #15

Open
neolinsu opened this issue Mar 28, 2023 · 18 comments
Open

P999 may not be accurate if Never Send is high. #15

neolinsu opened this issue Mar 28, 2023 · 18 comments

Comments

@neolinsu
Copy link

neolinsu commented Mar 28, 2023

Hi all,

I find the synthetic in Caladan endures a high Never-Send rate (above 1%) when clients issue requests with a relatively high rate which is close to server’s capacity.
This is especially problematic under a Poisson distribution: when two adjacent requests are generated within a short time window (i.e., a bursty period), the latter one is more likely to be droped due to the Never-Send logic (see code). We have profiled Caladan's client logic, and find that the scheduling often causes the request to be delayed (which has already violated the Poisson distribution) and finally be dropped.

We further designed an experiment to confirm this. We modify the Caladan client with the scheduling policy disabled: specifically, workers are bound to different cores, which can execute send, do_softirq (directpath), handle_timeout, and recv in cycles without yielding.
We equip Caladan server with 4 kthreads and launch 16 client workers (each of which owns a TCP connection) to generate requests w/ a Poisson distribution and vary the request rate (last for 32 seconds). The following table shows the experiment result:

Client Type Thoughtput(pps) P50 (us) P90 (us) P999 (us) Never Send
synthetic 0.75M 13.4 23.1 40.0 1.39%
client w/o sched 0.75M 9.809091 18.500909 50.903636 0.000442%
synthetic 0.8M 13.6 22.6 37.9 1.4486%
client w/o sched 0.8M 9.49 16.81 584.52 0.000430%
synthetic 1M 13.6 21.9 38.6 1.6726%
client w/o sched 1M 9.64 17.59 2841.83 0.000694%
synthetic 1.1M 13.5 21.1 55.5 1.7345%
client w/o sched 1.1M 10.6 21.9 5177.75 0.000781%
@joshuafried
Copy link
Member

Thanks! Indeed, with insufficient resources the client itself can become the bottleneck. We typically run the load generator with spinning kthreads - see here and many cores. When one client is insufficient to generate load, we typically use multiple machines. What are the details of your machine, and what configurations are you using for your client?

@neolinsu neolinsu reopened this Mar 28, 2023
@neolinsu
Copy link
Author

The cpu is Intel Xeon 2.20GHz with 20 Hyper-Threadings (10 Phy cores), which are set to performance mode. The network is 100Gb RDMA.
The configuration I use for both two clients is the same as (replace some useless info):

host_addr 10.100.100.103
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 16
runtime_guaranteed_kthreads 16
runtime_spinning_kthreads 16
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

I also notice that even client stresses at a low throughput (like 0.75 M) where resources should be sufficient, Never Send rate is still above 1%.

@joshuafried
Copy link
Member

Can you post the output of a client here (and the parameters used to launch it)? Looking at some recent runs I see that even at 1MPPS my never sent rate is < .1%

@neolinsu
Copy link
Author

Here is example to run 0.8M throughput.

synthetic --config synthetic.config 10.100.100.102:5190  --output=buckets --protocol memcached --mode runtime-client --threads 16 --runtime 32 --barrier-peers 1 --barrier-leader node151 --distribution=exponential --mpps=0.8 --samples=1 --transport tcp --nvalues=3200000

And synthetic's result is:

Distribution, Target, Actual, Dropped, Never Sent, Median, 90th, 99th, 99.9th, 99.99th, Start
exponential, 788411, 788411, 0, 326090, 13.6, 22.6, 33.2, 37.3, 37.9, 0, 8510673237596225

@joshuafried
Copy link
Member

Hm, that is quite high. Can you post a log with many samples at lower loads (change the above command to --samples 20). Also can you try reducing the number of kthreads to 8 and see if that has any impact?

@neolinsu
Copy link
Author

Can you post the output of a client here (and the parameters used to launch it)? Looking at some recent runs I see that even at 1MPPS my never sent rate is < .1%

I run server with

runtime_kthreads 4
runtime_guaranteed_kthreads 0
runtime_spinning_kthreads 0

It makes the cores mwait.

Would you plz share your configuration of server?

@joshuafried
Copy link
Member

The server had 20 kthreads (20 guaranteed, 0 spinning). Does varying the server configuration impact the client behavior here?

@neolinsu
Copy link
Author

The server had 20 kthreads (20 guaranteed, 0 spinning). Does varying the server configuration impact the client behavior here?

Yes. I think 20 kthreads can handle 1M pps.

You can try my configuration.

@neolinsu
Copy link
Author

The point here is not how many guaranteed kthreads the Caladan server used. Instead, given the number of guaranteed kthreads (say 4 cores), we send client requests at a rate that is close (but lower) to the maximum capacity that the Caladan server can handle (say 1 Mpps), in this setup, no matter how many physical cores are used at client machines (even one core per connection), Never-Send rate is always high. As a result, the generated requests exhibit a distribution which is less bursty as expected.

W/ our modified clients (disable scheduling and let the softirq process one packet at a time), Never-Send rate is low. At this time, the generated requests follow a distribution that is more consistent to a Poisson distribution, but Caladan’s P999 latency becomes much higher.

@joshuafried
Copy link
Member

Does this behavior change if you use many connections to the server? Say 100?

@joshuafried
Copy link
Member

I'm trying to understand where the source of the delay is coming from that is causing so many never-sent packets. Please correct me if I am wrong in understanding the scenario here: the server machine is being tested at a load point close to its peak throughput. The client process/machine is not at full utilization and is not a bottleneck. Does this seem correct?

@neolinsu
Copy link
Author

I'm trying to understand where the source of the delay is coming from that is causing so many never-sent packets. Please correct me if I am wrong in understanding the scenario here: the server machine is being tested at a load point close to its peak throughput. The client process/machine is not at full utilization and is not a bottleneck. Does this seem correct?

Yes, this is correct

@neolinsu
Copy link
Author

Does this behavior change if you use many connections to the server? Say 100?

It seems the Never Send rate becomes higher as # of connections grows.

@joshuafried
Copy link
Member

I'd be interested in trying to reproduce these results since they generally don't match what I've seen in my setup so far. Can you provide me the commit numbers that you are running on for caladan and mecached, the configuration files for both clients and server, the launch parameters and output logs for the iokernel, memcached, and the loadgen instances?

@neolinsu
Copy link
Author

neolinsu commented Mar 29, 2023

Configs for Replay

caladan-all: 37a3822be053c37275f0aefea60da26246fd01cb

Client

  • cmd
synthetic --config synthetic.config 10.100.100.102:5190  --output=buckets --protocol memcached --mode runtime-client --threads 16 --runtime 32 --barrier-peers 1 --barrier-leader node151 --distribution=exponential --mpps=0.8 --samples=1 --transport tcp --nvalues=3200000
  • Configuration
host_addr 10.100.100.103
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 16
runtime_guaranteed_kthreads 16
runtime_spinning_kthreads 16
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

Server

  • cmd
memcached memcached.config -t 16 -U 5190 -p 5190 -c 32768 -m 32000 -b 32768 -o hashpower=25,no_hashexpand,lru_crawler,lru_maintainer,idle_timeout=0,slab_reassign
  • Configuration
host_addr 10.100.100.102
host_netmask 255.255.255.0
host_gateway 10.100.100.1
runtime_kthreads 4
runtime_guaranteed_kthreads 4
host_mac X
disable_watchdog true
runtime_qdelay_us 10
runtime_priority lc
static_arp 10.100.100.102 X
static_arp 10.100.100.103 X
enable_directpath fs
directpath_pci X

Other Setups

  • We keep the kthreads on the same NUMA node which is connected with the RDMA NIC.
  • The CPU cores where iokerneld and server workers run have been set to ioslcpus and nohz_full lists in the boot-up cmd.
  • We run iokerneld with ias policy.

@neolinsu
Copy link
Author

I'd be interested in trying to reproduce these results since they generally don't match what I've seen in my setup so far.

Would you like to share your configuration and result? Specially, the Never Send when the request rate is close to the maximum capacity that the Caladan server can handle.

@joshuafried
Copy link
Member

Can you also share the outputs/logs from the various programs that you've launched? Also, caladan-all @ 37a3822b points to caladan @ 4a254bf, though I see some of your configurations imply a later version of caladan (ie using the directpath_pci config etc). Can you please confirm the version that you are running, and whether there are any modifications that are made to it?

@neolinsu
Copy link
Author

neolinsu commented Mar 29, 2023

Can you also share the outputs/logs from the various programs that you've launched? Also, caladan-all @ 37a3822b points to caladan @ 4a254bf, though I see some of your configurations imply a later version of caladan (ie using the directpath_pci config etc). Can you please confirm the version that you are running, and whether there are any modifications that are made to it?

We use Caladan @ 1ab79505 and memcached from caladan-all @ 37a3822b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants