Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

Open
philandstuff opened this issue Jul 17, 2024 · 3 comments
Assignees

Comments

@philandstuff
Copy link

Expected Behavior

A flurry of transient connection errors should not result in a client getting into a permanently broken state. At some point, it should try a new connection.

Current Behavior

If you have a redis.Client with a PoolSize of, say, 5, and you get 5 failed connections in a row without an intervening successful connection, every subsequent call to that redis client will fail. It will return a stored error without trying to make a new connection.

Possible Solution

See "Possible Implementation" below.

Steps to Reproduce

Reproduction courtesy of my colleague @nickstenning: https://gist.github.com/nickstenning/e9479b62b3609927e27dc95e05385250

In this repro, we create a client with a PoolSize of 2. It then experiences two failed connections in a row, and from that point forward never attempts to make a new connection again, only returns the last stored connection error.

Context (Environment)

This affected us because we had PoolSize=2 (which makes sense for our use case where we have mostly single-threaded access to the redis.Client), we saw 2 connection errors in a row, and then our app got stuck in a tight spinloop where it was repeatedly calling redis and getting an error back. Our expectation was that calling the redis client would perform some sort of network operation, so it was surprising that it did not.

We were running go-redis v9.5.3 (which has disappeared, like v9.5.4 - see #3057).

Detailed Description

The relevant code is here: https://github.com/redis/go-redis/blob/v9.5.2/internal/pool/pool.go#L209-L220

In particular, once p.dialErrorsNum reaches p.cfg.PoolSize, dialConn() will always return an error without attempting to create a connection.

Possible Implementation

Remove the code that's counting and storing errors. It's not clear to me what the intention of this code is, but I don't think getting the client into a permanently broken state like this is a good idea.

@dmaier-redislabs
Copy link

dmaier-redislabs commented Jul 18, 2024

@philandstuff Thanks for the report. @vladvildanov @ofekshenawa Could you please check if we need to fix something before releasing 9.6 GA?

@monkey92t
Copy link
Collaborator

Yes, as you can see, when the number of errors in creating new connections reaches PoolSize, an error will be returned by default, but a probe will continuously attempt to connect to that address. Our goal is to avoid continually dialing connections when the redis-server or network is experiencing issues.

When the probe detects that the connection is restored, go-redis will resume normal operation. The interval is 1 second.

func (p *ConnPool) tryDial() {
	for {
		if p.closed() {
			return
		}

		conn, err := p.cfg.Dialer(context.Background())
		if err != nil {
			p.setLastDialError(err)
			time.Sleep(time.Second)
			continue
		}

		atomic.StoreUint32(&p.dialErrorsNum, 0)
		_ = conn.Close()
		return
	}
}

@philandstuff
Copy link
Author

@monkey92t thank you. Yes that makes sense, I misread the code before. If I judiciously add a time.Sleep into the repro, I can see that it does eventually recover.

In this case I think the issue for us is slightly different: we have a loop that looks like:

for {
    res, err := rdb.XReadGroup(ctx, ...).Result()
    if err != nil {
        log.Warnf("Error fetching message: %v", err)
        continue
    }
}

because we expect XReadGroup to be a blocking call that will take some time in success or failure cases. We didn't expect that this call might repeatedly return immediately with an error. I think I would find it less surprising if it blocked until a connection was established and free (or until context was canceled).

For now, we will add a sleep to this error case to avoid a tight spinloop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants