You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A flurry of transient connection errors should not result in a client getting into a permanently broken state. At some point, it should try a new connection.
Current Behavior
If you have a redis.Client with a PoolSize of, say, 5, and you get 5 failed connections in a row without an intervening successful connection, every subsequent call to that redis client will fail. It will return a stored error without trying to make a new connection.
In this repro, we create a client with a PoolSize of 2. It then experiences two failed connections in a row, and from that point forward never attempts to make a new connection again, only returns the last stored connection error.
Context (Environment)
This affected us because we had PoolSize=2 (which makes sense for our use case where we have mostly single-threaded access to the redis.Client), we saw 2 connection errors in a row, and then our app got stuck in a tight spinloop where it was repeatedly calling redis and getting an error back. Our expectation was that calling the redis client would perform some sort of network operation, so it was surprising that it did not.
We were running go-redis v9.5.3 (which has disappeared, like v9.5.4 - see #3057).
In particular, once p.dialErrorsNum reaches p.cfg.PoolSize, dialConn() will always return an error without attempting to create a connection.
Possible Implementation
Remove the code that's counting and storing errors. It's not clear to me what the intention of this code is, but I don't think getting the client into a permanently broken state like this is a good idea.
The text was updated successfully, but these errors were encountered:
Yes, as you can see, when the number of errors in creating new connections reaches PoolSize, an error will be returned by default, but a probe will continuously attempt to connect to that address. Our goal is to avoid continually dialing connections when the redis-server or network is experiencing issues.
When the probe detects that the connection is restored, go-redis will resume normal operation. The interval is 1 second.
@monkey92t thank you. Yes that makes sense, I misread the code before. If I judiciously add a time.Sleep into the repro, I can see that it does eventually recover.
In this case I think the issue for us is slightly different: we have a loop that looks like:
because we expect XReadGroup to be a blocking call that will take some time in success or failure cases. We didn't expect that this call might repeatedly return immediately with an error. I think I would find it less surprising if it blocked until a connection was established and free (or until context was canceled).
For now, we will add a sleep to this error case to avoid a tight spinloop.
Expected Behavior
A flurry of transient connection errors should not result in a client getting into a permanently broken state. At some point, it should try a new connection.
Current Behavior
If you have a redis.Client with a PoolSize of, say, 5, and you get 5 failed connections in a row without an intervening successful connection, every subsequent call to that redis client will fail. It will return a stored error without trying to make a new connection.
Possible Solution
See "Possible Implementation" below.
Steps to Reproduce
Reproduction courtesy of my colleague @nickstenning: https://gist.github.com/nickstenning/e9479b62b3609927e27dc95e05385250
In this repro, we create a client with a PoolSize of 2. It then experiences two failed connections in a row, and from that point forward never attempts to make a new connection again, only returns the last stored connection error.
Context (Environment)
This affected us because we had PoolSize=2 (which makes sense for our use case where we have mostly single-threaded access to the redis.Client), we saw 2 connection errors in a row, and then our app got stuck in a tight spinloop where it was repeatedly calling redis and getting an error back. Our expectation was that calling the redis client would perform some sort of network operation, so it was surprising that it did not.
We were running go-redis v9.5.3 (which has disappeared, like v9.5.4 - see #3057).
Detailed Description
The relevant code is here: https://github.com/redis/go-redis/blob/v9.5.2/internal/pool/pool.go#L209-L220
In particular, once p.dialErrorsNum reaches p.cfg.PoolSize, dialConn() will always return an error without attempting to create a connection.
Possible Implementation
Remove the code that's counting and storing errors. It's not clear to me what the intention of this code is, but I don't think getting the client into a permanently broken state like this is a good idea.
The text was updated successfully, but these errors were encountered: