-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approximately 50ms latency increase observed for synchronous vs asynchronous calls #323
Comments
This strikes me as a mailbox congestion issue that makes a lot of sense for Erlang and Elixir applications that benefit best from systems that can be parallelized. |
@cprice404 if you do have a script/livebook to replicate the issue, it would be helpful! |
@polvalente thanks, we'll try to put a simple repro together. |
I have not been able to reproduce this so far with a local insecure/plaintext/http grpc server. So it may have something to do with the TLS setup. Providing a completely self-contained example including the server then becomes challenging because of certificates and DNS. So, attached is a tarball that makes grpc requests to an actual deployed (development) instance of our service. The tarball contains both a python and an elixir client example. To run the python example, from the root dir of the expanded tarball (requires
You will see the python grpc client making a bunch of requests in a loop, with a fairly consistent latency being printed out (for me, about 10-12ms, but this will depend on your network). Then, elixir (also from the tarball root):
Does pretty much the exact same thing as the python client, but you will notice that there is an extra ~50ms reported in the latency for each request. I tried to instrument a local copy of elixir-grpc/gun and the extra time appears to happen in And again the other observation is that if I do a bunch of concurrent async requests instead of serial synchronous requests like this, that ~50ms penalty seems to happen much less frequently (but it still happens). |
I also tried changing my tls options to:
In case it was some issue with the |
I also wrote a simple program that times gun HTTPS requests in a loop, and I don't see this issue there - but that server is HTTP 1.1. Poking around in the gun docs I saw this: https://ninenines.eu/docs/en/gun/2.0/guide/internals_tls_over_tls/#_stream_wide_tls_over_tls Which makes me wonder if this is related to HTTP/2 TLS with gun. |
The difference is even more dramatic locally. The Elixir client is taking ~500ms while the Python client is taking ~200ms |
Some things I noticed running locally:
I suspect somewhere in the code we're sending an extra request that Python is either not sending or merging into another request somehow |
or possibly sending the TLS certs/handshake packets more often than the python client, since so far I haven't been able to reproduce this without TLS. Just a guess though. |
It could be that somehow the TLS connection isn't being actually reused. What makes me wonder though is that I saw the same behavior in using the Mint adapter, which makes me suspect it's the library's fault and not the actual underlying client's. |
@polvalente yeah I also observed the same thing with Mint but am not familiar enough with them to know if Mint uses a similar strategy for TLS. However, this morning I set up a simple HTTP2 app hosted by nginx with SSL and ran my tests using plain http clients rather than grpc clients. For the python client I used So this is one more data point suggesting that the issue lies somewhere in the elixir-grpc library. Here's the basic script I was running for gun. Based on my attempt to step through the elixir-grpc open_opts = %{
protocols: [:http2],
transport: :tls,
tls_opts: [
verify: :verify_peer,
cacertfile: "/path/to/minica/certs.pem"
],
}
{:ok, conn_pid} = :gun.open(String.to_charlist("myEc2Host"), 8000, open_opts)
case :gun.await_up(conn_pid) do
{:ok, :http2} -> Logger.info("Initialized HTTP2 connection")
{:error, e} ->
Logger.error("Failed to initialize HTTP2 connection: #{inspect(e)}")
raise inspect(e)
end
for n <- 1..1_000_000 do
start_time = :os.system_time(:milli_seconds)
stream_ref = :gun.get(conn_pid, "/")
headers = :gun.await(conn_pid, stream_ref)
Logger.info("Headers: #{inspect(headers)}")
body = :gun.await(conn_pid, stream_ref)
Logger.info("Body: #{inspect(body)}")
duration = :os.system_time(:milli_seconds) - start_time
Logger.info("Finished request; duration: #{duration}")
end If it is looking like it's not the http client's issue, does anyone have any advice on next steps to debug what might be going wrong in the elixir-grpc library code? |
At the h2 stream level that may be possible. At the TCP/connection level I don't think that's the case because I have been monitoring my runs with |
We are seeing a 3x increased latency in Elixir vs C# while talking to the local insecure gRPC server. Also using Mint and exactly with the same symptom of Elixir being more chatty NFIBrokerage/spear#92 (maybe a red herring though) |
Sorry for the long delay. If changing clients does not seem to affect the overall performance, there seems to be some sort of protocol implementation difference on the layers above. If anyone wants to look into this, I'm happy to accept any help :) |
I apparently have the same issue. I'm not able to reproduce it in local environment, only in production. |
Not sure if the cause is the same for you (it seems like so, because you mentioned you're using amazon linux), @cprice404, but here, the issue was Amazon's load balancer. Nagle algorithm + TCP delay interaction were adding a bit over 40ms to requests. Confirmed this on a TCP dump. Skipping the LB altogether solves the issue. |
@v0idpwn thanks for the info! some permutations of the testing I mentioned above did indeed involve an AWS NLB. However, I don't observe the same perf issues with grpc clients in other programming languages, even when talking to the same server, so I still believe that there is/was something going awry in the elixir grpc code path. I haven't tried to repro this since last year though, so I'm not sure if any recent updates to elixir-grpc or underlying http libs may have improved the situations. |
That's interesting. In my experience there's significant difference in the frames sent by this library and other grpc clients, but I didn't try them with the load balancer. Will try soon and update here. Maybe it's just a matter of sending 1 or 2 additional frames and it will work with the LB. |
Describe the bug
We have a very low-latency gRPC service that we have written client code for in several different programming languages. When we load test it in most languages, we are able to observe a p99 latency of < 5ms for our basic APIs.
When calling our service using elixir-grpc, we observe the following:
To Reproduce
I can provide a simple reproducer script if requested; I have one right now that can be run against our production gRPC service but requires a (free) auth token. Alternately I could write a small echo server gRPC service and provide both the client and the server code to repro. If someone is able to help look into this please let me know which you prefer and I will update the issue.
Expected behavior
Calling APIs synchronously in a simple loop should not exhibit worse (higher) client-side latencies than calling the same APIs asynchronously and concurrently.
Logs
N/A
Protos
Can provide these upon request depending on whether you would prefer to see a reproducer against our production gRPC service or a standalone client/server reproducer.
Versions:
Additional context
If there is anyone who is able to help look into this please let me know what additional context would be useful and we will provide it!
The text was updated successfully, but these errors were encountered: