Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared_client: Bump request id #12

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

jrajahalme
Copy link
Member

Only fail out if non-conflicting request id can not be found.

This works on the premise that the callers are fine with the request id
being modified at this point. Current use sets a random id just prior to
Exchange call, so this premise is satisfied.

@jrajahalme jrajahalme requested a review from gandro April 25, 2024 13:50
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems harmless to me, but I'm still not confident in my understanding of SharedClient.

cc @marseel if you also want to take a look, since you know that particular piece better than I do

@marseel
Copy link

marseel commented Apr 25, 2024

I was thinking about it yesterday :)

I think it looks alright, but the main trigger for this issue is probably something like this:

  • We are sending a constant stream of DNS requests
  • For some of them we get responses, for some we don't
  • We never time out waiting for requests in waitingResponses and they get stuck there - this happens because we have only a single timeout here that is never triggered as we still receive some of the responses.

So I think this change definitely makes sense, but more important IMHO would be to have separate timeouts per request in waitingResponses

Does it make sense @jrajahalme ?

@jrajahalme
Copy link
Member Author

I was thinking about it yesterday :)

I think it looks alright, but the main trigger for this issue is probably something like this:

  • We are sending a constant stream of DNS requests
  • For some of them we get responses, for some we don't
  • We never time out waiting for requests in waitingResponses and they get stuck there - this happens because we have only a single timeout here that is never triggered as we still receive some of the responses.

So I think this change definitely makes sense, but more important IMHO would be to have separate timeouts per request in waitingResponses

Does it make sense @jrajahalme ?

Makes total sense. I'll look into adding request-specific timeouts.

@jrajahalme
Copy link
Member Author

@marseel Added per-request timeout handling via a deadline queue using container/heap, please have a look!

Copy link

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandro probably you want to take another look as well :)

shared_client.go Outdated
if wq.Len() == 0 {
return 10 * time.Minute
}
return wq.waiters[0].deadline.Sub(time.Now())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return time.Until(wq.waiters[0].deadline)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

shared_client.go Outdated
break
}
wtr := heap.Pop(wq).(*waiter)
wtr.ch <- sharedClientResponse{nil, 0, context.DeadlineExceeded}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not sure how to return a net.Error Timeout though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to implement net.Error interface:

// errTimeout is an an error representing a request timeout.
// Implements net.Error
type errTimeout struct {
}

func (e errTimeout) Timeout() bool { return true }

// Temporary is deprecated. Return false.
func (e errTimeout) Temporary() bool { return false }

func (e errTimeout) Error() string {
	return "request timed out"
}

var netErrorTimeout errTimeout

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

shared_client.go Outdated
if waiters.Exists(req.msg.Id) {
// find next available ID
duplicate := true
for id := req.msg.Id + 1; id != req.msg.Id; id++ {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how safe it is to use kind of "predictable" Ids - https://nvd.nist.gov/vuln/detail/CVE-2008-0087 . I would probably try a few times (for example 3) Id() instead. That would also give a more predictable runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each request starts from a random number (via Id()) by the caller), so it should not be likely that the Id's of consecutive requests would be sequential. This relies on the caller, though.

IMO 3 random tries might not work if there are a lot of requests in flight?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 3 random tries might not work if there are a lot of requests in flight?

true, but that would require ~10s of thousands of inflight requests - meaning probably something else is wrong already. I would consider using fixed-number as a circuit-breaker.

For example, with 32k inflight requests, we would still have a chance of 1 - (32000/65000)^4 ~= 0.94 of getting a new correct ID.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, with 5 tries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using fixed-number as a circuit-breaker.

Did not get this, though?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, nvm that's what I meant essentially, fixed loop.

Only fail out if non-conflicting request id can not be found in couple of
tries.

This works on the premise that the callers are fine with the request id
being modified at this point. Current use sets a random id just prior to
Exchange call, so this premise is satisfied.

Signed-off-by: Jarno Rajahalme <[email protected]>
Buffer responses channel so that the handler does not get blocked of the
channel is not received from (e.g., after a timeout).

Signed-off-by: Jarno Rajahalme <[email protected]>
Use the configured read timeout to bound the time spent on receiving the
response, instead of waiting for a full minute.

Signed-off-by: Jarno Rajahalme <[email protected]>
Drain requests on handler close, so that pending requests are terminated
immediately when handler needs to close for an error condition, rather
than having the requests time out. This allows the handler to be recycled
faster for new requests.

Signed-off-by: Jarno Rajahalme <[email protected]>
container/heap uses `any`, which was added in Go 1.18. Bump tested Go
versions to accomodate for this.

Signed-off-by: Jarno Rajahalme <[email protected]>
Tell handler to delete waiters after request times out.

Signed-off-by: Jarno Rajahalme <[email protected]>
Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to do a more in-depth review of the last commit, but here are is a bit of feedback from a first pass

// Drain requests in case they come in while we are closing
// down. This loop is done only after 'requests' channel is closed in
// SharedClient.close() and it is not possible for new requests or timeouts
// to be sent on those closed channels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow this comment. Sending on closed channel panics - surely that's not how we prevent the senders from sending requests? I assume the real check from preventing senders is that conn is nil, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SharedClient.close() (note: not a close(<channel>)) is called when the shared client can no longer be used for new requests. So there is nothing sent to the closed channel. The point of the comment is that the range loop on the channel completes only after the channel is closed (by the side sending to the channel), so we are guaranteed to send replies to all requests received on this channel.

@@ -7,7 +7,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
go: [ 1.17.x, 1.18.x ]
go: [ 1.18.x, 1.19.x ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated nit: Given the sheer amount of highly concurrent code in shared client, we should also run with go test -race in this workflow

for {
// update timer
deadline.Reset(waiters.GetTimeout())
Copy link
Member

@gandro gandro Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset is not safe to use when the timer is not drained https://pkg.go.dev/time#Timer.Reset

How do we know that the timer is drained here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noting this, will have to work around this...

@jrajahalme
Copy link
Member Author

@marseel FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

  ❌ Found 1 logs in kind-kind/kube-system/cilium-jgxqz (cilium-agent) matching list of errors that must be investigated:
time="2024-05-03T07:20:54Z" level=error msg="Cannot forward proxied DNS lookup" DNSRequestID=45017 dnsName=one.one.one.one. endpointID=1252 error="duplicate request id 224" identity=10831 ipAddr="10.244.2.92:47175" subsys=fqdn/dnsproxy (1 occurrences)

@marseel
Copy link

marseel commented May 6, 2024

FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

Interesting, that would mean we were hitting this issue even with a low number of concurrent requests 🤔
In that case, I guess we could merge retry only first to mitigate these failures and then follow-up with heap and timeouts.

@jrajahalme
Copy link
Member Author

FYI we are seeing CI flakes due to ID collisions in cilium main, so it is prudent to at least be retrying to get a good ID (like in this PR):

Interesting, that would mean we were hitting this issue even with a low number of concurrent requests 🤔 In that case, I guess we could merge retry only first to mitigate these failures and then follow-up with heap and timeouts.

Here's a PR for the 1st commit only: #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants