Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

philandstuff · 2024-07-17T16:17:43Z

Expected Behavior

A flurry of transient connection errors should not result in a client getting into a permanently broken state. At some point, it should try a new connection.

Current Behavior

If you have a redis.Client with a PoolSize of, say, 5, and you get 5 failed connections in a row without an intervening successful connection, every subsequent call to that redis client will fail. It will return a stored error without trying to make a new connection.

Possible Solution

See "Possible Implementation" below.

Steps to Reproduce

Reproduction courtesy of my colleague @nickstenning: https://gist.github.com/nickstenning/e9479b62b3609927e27dc95e05385250

In this repro, we create a client with a PoolSize of 2. It then experiences two failed connections in a row, and from that point forward never attempts to make a new connection again, only returns the last stored connection error.

Context (Environment)

This affected us because we had PoolSize=2 (which makes sense for our use case where we have mostly single-threaded access to the redis.Client), we saw 2 connection errors in a row, and then our app got stuck in a tight spinloop where it was repeatedly calling redis and getting an error back. Our expectation was that calling the redis client would perform some sort of network operation, so it was surprising that it did not.

We were running go-redis v9.5.3 (which has disappeared, like v9.5.4 - see #3057).

Detailed Description

The relevant code is here: https://github.com/redis/go-redis/blob/v9.5.2/internal/pool/pool.go#L209-L220

In particular, once p.dialErrorsNum reaches p.cfg.PoolSize, dialConn() will always return an error without attempting to create a connection.

Possible Implementation

Remove the code that's counting and storing errors. It's not clear to me what the intention of this code is, but I don't think getting the client into a permanently broken state like this is a good idea.

dmaier-redislabs · 2024-07-18T15:19:55Z

@philandstuff Thanks for the report. @vladvildanov @ofekshenawa Could you please check if we need to fix something before releasing 9.6 GA?

monkey92t · 2024-07-18T15:24:49Z

Yes, as you can see, when the number of errors in creating new connections reaches PoolSize, an error will be returned by default, but a probe will continuously attempt to connect to that address. Our goal is to avoid continually dialing connections when the redis-server or network is experiencing issues.

When the probe detects that the connection is restored, go-redis will resume normal operation. The interval is 1 second.

func (p *ConnPool) tryDial() {
	for {
		if p.closed() {
			return
		}

		conn, err := p.cfg.Dialer(context.Background())
		if err != nil {
			p.setLastDialError(err)
			time.Sleep(time.Second)
			continue
		}

		atomic.StoreUint32(&p.dialErrorsNum, 0)
		_ = conn.Close()
		return
	}
}

philandstuff · 2024-07-19T13:16:28Z

@monkey92t thank you. Yes that makes sense, I misread the code before. If I judiciously add a time.Sleep into the repro, I can see that it does eventually recover.

In this case I think the issue for us is slightly different: we have a loop that looks like:

for {
    res, err := rdb.XReadGroup(ctx, ...).Result()
    if err != nil {
        log.Warnf("Error fetching message: %v", err)
        continue
    }
}

because we expect XReadGroup to be a blocking call that will take some time in success or failure cases. We didn't expect that this call might repeatedly return immediately with an error. I think I would find it less surprising if it blocked until a connection was established and free (or until context was canceled).

For now, we will add a sleep to this error case to avoid a tight spinloop.

dmaier-redislabs assigned vladvildanov Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

philandstuff commented Jul 17, 2024

dmaier-redislabs commented Jul 18, 2024 •

edited

Loading

monkey92t commented Jul 18, 2024

philandstuff commented Jul 19, 2024

Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

Getting $PoolSize connection errors in a row permanently breaks a redis client #3062

Comments

philandstuff commented Jul 17, 2024

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce

Context (Environment)

Detailed Description

Possible Implementation

dmaier-redislabs commented Jul 18, 2024 • edited Loading

monkey92t commented Jul 18, 2024

philandstuff commented Jul 19, 2024

dmaier-redislabs commented Jul 18, 2024 •

edited

Loading