-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shared_client: Bump request id #12
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems harmless to me, but I'm still not confident in my understanding of SharedClient.
cc @marseel if you also want to take a look, since you know that particular piece better than I do
I was thinking about it yesterday :) I think it looks alright, but the main trigger for this issue is probably something like this:
So I think this change definitely makes sense, but more important IMHO would be to have separate timeouts per request in Does it make sense @jrajahalme ? |
Makes total sense. I'll look into adding request-specific timeouts. |
0171a40
to
b993c3d
Compare
@marseel Added per-request timeout handling via a deadline queue using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandro probably you want to take another look as well :)
shared_client.go
Outdated
if wq.Len() == 0 { | ||
return 10 * time.Minute | ||
} | ||
return wq.waiters[0].deadline.Sub(time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: return time.Until(wq.waiters[0].deadline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
shared_client.go
Outdated
break | ||
} | ||
wtr := heap.Pop(wq).(*waiter) | ||
wtr.ch <- sharedClientResponse{nil, 0, context.DeadlineExceeded} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to change https://github.com/cilium/cilium/blob/931b8167ea29bfd3ae8e6f11f41a8a1c531c33c8/pkg/fqdn/dnsproxy/proxy.go#L599
or switch this error to net.Error Timeout IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, not sure how to return a net.Error Timeout though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to implement net.Error interface:
// errTimeout is an an error representing a request timeout.
// Implements net.Error
type errTimeout struct {
}
func (e errTimeout) Timeout() bool { return true }
// Temporary is deprecated. Return false.
func (e errTimeout) Temporary() bool { return false }
func (e errTimeout) Error() string {
return "request timed out"
}
var netErrorTimeout errTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
shared_client.go
Outdated
if waiters.Exists(req.msg.Id) { | ||
// find next available ID | ||
duplicate := true | ||
for id := req.msg.Id + 1; id != req.msg.Id; id++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how safe it is to use kind of "predictable" Ids - https://nvd.nist.gov/vuln/detail/CVE-2008-0087 . I would probably try a few times (for example 3) Id()
instead. That would also give a more predictable runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each request starts from a random number (via Id()
) by the caller), so it should not be likely that the Id's of consecutive requests would be sequential. This relies on the caller, though.
IMO 3 random tries might not work if there are a lot of requests in flight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO 3 random tries might not work if there are a lot of requests in flight?
true, but that would require ~10s of thousands of inflight requests - meaning probably something else is wrong already. I would consider using fixed-number as a circuit-breaker.
For example, with 32k inflight requests, we would still have a chance of 1 - (32000/65000)^4 ~= 0.94 of getting a new correct ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, with 5 tries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider using fixed-number as a circuit-breaker.
Did not get this, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, nvm that's what I meant essentially, fixed loop.
b993c3d
to
b4f3e11
Compare
Only fail out if non-conflicting request id can not be found in couple of tries. This works on the premise that the callers are fine with the request id being modified at this point. Current use sets a random id just prior to Exchange call, so this premise is satisfied. Signed-off-by: Jarno Rajahalme <[email protected]>
Buffer responses channel so that the handler does not get blocked of the channel is not received from (e.g., after a timeout). Signed-off-by: Jarno Rajahalme <[email protected]>
Use the configured read timeout to bound the time spent on receiving the response, instead of waiting for a full minute. Signed-off-by: Jarno Rajahalme <[email protected]>
Drain requests on handler close, so that pending requests are terminated immediately when handler needs to close for an error condition, rather than having the requests time out. This allows the handler to be recycled faster for new requests. Signed-off-by: Jarno Rajahalme <[email protected]>
container/heap uses `any`, which was added in Go 1.18. Bump tested Go versions to accomodate for this. Signed-off-by: Jarno Rajahalme <[email protected]>
Tell handler to delete waiters after request times out. Signed-off-by: Jarno Rajahalme <[email protected]>
b4f3e11
to
5724e90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to do a more in-depth review of the last commit, but here are is a bit of feedback from a first pass
// Drain requests in case they come in while we are closing | ||
// down. This loop is done only after 'requests' channel is closed in | ||
// SharedClient.close() and it is not possible for new requests or timeouts | ||
// to be sent on those closed channels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow this comment. Sending on closed channel panics - surely that's not how we prevent the senders from sending requests? I assume the real check from preventing senders is that conn
is nil, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SharedClient.close()
(note: not a close(<channel>)
) is called when the shared client can no longer be used for new requests. So there is nothing sent to the closed channel. The point of the comment is that the range loop on the channel completes only after the channel is closed (by the side sending to the channel), so we are guaranteed to send replies to all requests received on this channel.
@@ -7,7 +7,7 @@ jobs: | |||
runs-on: ubuntu-latest | |||
strategy: | |||
matrix: | |||
go: [ 1.17.x, 1.18.x ] | |||
go: [ 1.18.x, 1.19.x ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated nit: Given the sheer amount of highly concurrent code in shared client, we should also run with go test -race
in this workflow
for { | ||
// update timer | ||
deadline.Reset(waiters.GetTimeout()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset
is not safe to use when the timer is not drained https://pkg.go.dev/time#Timer.Reset
How do we know that the timer is drained here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noting this, will have to work around this...
@marseel FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):
|
Interesting, that would mean we were hitting this issue even with a low number of concurrent requests 🤔 |
Here's a PR for the 1st commit only: #13 |
Only fail out if non-conflicting request id can not be found.
This works on the premise that the callers are fine with the request id
being modified at this point. Current use sets a random id just prior to
Exchange call, so this premise is satisfied.