Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513

CharlieTLe · 2025-01-16T00:43:10Z

Noticed build failure: https://github.com/cortexproject/cortex/actions/runs/12798522991/job/35682746078?pr=6511

--- FAIL: TestRing_Get_Consistency (488.91s)
    --- FAIL: TestRing_Get_Consistency/replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent (61.47s)
        ring_test.go:735: 
            	Error Trace:	/__w/cortex/cortex/pkg/ring/ring_test.go:735
            	Error:      	"1" is not less than or equal to "0"
            	Test:       	TestRing_Get_Consistency/replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent
FAIL
FAIL	github.com/cortexproject/cortex/pkg/ring	704.951s

The text was updated successfully, but these errors were encountered:

dsabsay · 2025-01-18T03:11:07Z

After several experiments, I can reliably (8 out of 9 times) get this test to fail by running 4,000,000 iterations of it. I did this a number of ways. The fastest way is launching runtime.NumCPU() goroutines each running the test 4000000 / num_cores times. This takes about 5 minutes on my laptop (utilizing all cores to near maximum).

In one failure I looked at, the "new" replication set contains 2 instances that were NOT in the initial set, and is missing two instances that WERE in the initial set. I haven't looked at enough results to know if that's the normal way it fails.

SPECULATION: After skimming the code under test, I don't see any use of concurrency, which rules out race conditions as a possible cause (unless I'm missing something in the ring code). Other possible causes of flakiness that I'm left pondering:

Off by 1 or similar in ring.Get that is triggered when just the right key and token boundaries are used. The only things of (seeming) significance that change between test iterations are the tokens used as the lookup key AND the tokens for each ring instance (all randomly generated).

QUESTION: The name of this test is "replication set should not change if ring did not change, when num of instance per zone is inconsistent". I understand the first part, and I understand the role of the ring and the intent of consistent hashing, but I don't understand what "when num of instance per zone is inconsistent". Can anyone take a guess at what that means?

dsabsay · 2025-01-22T04:10:01Z

I observed what happens in ring.Get when this test fails. For the same token, the ringInstanceByToken map returns a different instance.

I believe the culprit is this line in ring.getTokensInfo():

cortex/pkg/ring/model.go

Line 482 in 4477add

for instanceID, instance := range d.Ingesters {

Map iteration order is undefined so if there are overlapping tokens (i.e. a given token is owned by multiple instances), then which instance is mapped to an overlapping token is non-deterministic. In this test, getTokensInfo() is called between the first and second call to ring.Get.

I propose to fix this by iterating over a sorted slice of keys instead of the map directly. I will submit a patch for us to discuss.

Signed-off-by: Daniel Sabsay <[email protected]>

CharlieTLe · 2025-01-25T18:45:39Z

Hi @dsabsay, thanks for looking into this!

I spent a couple of hours trying to figure out why this test would fail, but I'm not seeing how ordering the list of ingesters for creating a map of the tokens ring.getTokensInfo() would affect the output.

The test case is described below.

"replication set should not change if ring did not change, when num of instance per zone is inconsistent": {
	initialInstances:  11,
	addedInstances:    0,
	removedInstances:  0,
	numZones:          3,
	replicationFactor: 8,
	numDiff:           0,
},

The test seems to be failing because there's an inconsistency between the two Get operations. They should be returning the same set of ingesters, even when the number of instances per zone is inconsistent. The reason the number of instances per zone is inconsistent is because there are 11 ingesters and 3 zones in the test. 11 / 3 is not a whole number (3.67), which means that one zone will have 3 ingesters while the other zones will have 4 ingesters (3 + 4 + 4 = 11). If there were 12 ingesters and 3 zones in the test, then there would be a consistent number of ingesters per zone (4 + 4 + 4).

We know that both sets have a count of 8 instances since there are tests that asserts this.

set, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)
assert.NoError(t, err)
assert.Equal(t, testData.replicationFactor, len(set.Instances))

newSet, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)
assert.NoError(t, err)
assert.Equal(t, testData.replicationFactor, len(newSet.Instances))

In between the two tests there are some operations that seem to have mutated the ring to cause the second set to produce a different set of ingesters.

ring.ringTokens = ringDesc.GetTokens()
ring.ringTokensByZone = ringDesc.getTokensByZone()
ring.ringInstanceByToken = ringDesc.getTokensInfo()
ring.ringZones = getZones(ringDesc.getTokensByZone())

Our documentation indicates that this test case is proposing an unsupported setup.

Minimum number of zones

... It is safe to have more zones than the replication factor, but it cannot be less. Having fewer availability zones than the replication factor causes a replica write to be missed, and in some cases, the write fails if the availability zones count is too low.

We should remove the test cases where RF is greater than the number of zones. Perhaps we could add a condition in the ring creation that validates that when zone awareness is enabled that the RF should be <= number of zones.

@alanprot @justinjung04, could you help take a look at this?

justinjung04 · 2025-01-31T04:20:00Z

But is it true that replication factor shouldn't be greater than number of zones? If there is a cluster with 3 zones, I can imagine a usecase of setting replication factor to 6 so that I'm writing (or reading) to 2 instances per zone.

Perhaps the root cause of this issue is more than one instance sharing same token, I'll look into that code a bit more to see if it's an expected behavior

dsabsay · 2025-02-04T00:32:13Z

@CharlieTLe thanks for explaining what "inconsistent" means in this case. Makes sense.

I also found the same things as you.

Additionally, I found that as @justinjung04 suspected, sometimes more than one instance shares a token. That's when the order of iteration matters when building the tokens->instance map. Whichever instance is last of the shared group will be mapped to the shared token X. When the iteration order changes (as it can do since it is undefined in go), this can lead to mapping the same token to different instances over time.

I ran my repro test with replication factor 3 a total of 6 times now (6 * 4M iterations) and haven't seen it fail. I need to run it a few more times to be confident, but this may suggest that this issue only occurs when RF > numZones.

dsabsay · 2025-02-04T23:08:50Z

I ran the repro test with RF=3 for 4 more times (4 * 4M iterations) and they also didn't fail.

Based on this, I'd say the flakiness reported in this issue only occurs when RF > numZones.

justinjung04 · 2025-02-05T22:57:25Z

So I was originally suspecting adding / removing instances from the ring would cause sharing of a token when RF > numZones.

However, the test case that's failing does not introduce any change to the ring:

"replication set should not change if ring did not change, when num of instance per zone is inconsistent": {
	initialInstances:  11,
	addedInstances:    0,
	removedInstances:  0,
	numZones:          3,
	replicationFactor: 8,
	numDiff:           0,
},

so the test is basically comparing

// get original replication set
set, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)

// change ring information based on instances added/removed, but for this particular test this shouldn't change anything
ring.ringTokens = ringDesc.GetTokens()
ring.ringTokensByZone = ringDesc.getTokensByZone()
ring.ringInstanceByToken = ringDesc.getTokensInfo()
ring.ringZones = getZones(ringDesc.getTokensByZone())

// get new replication set based on updated ring information
// newSet, err := ring.Get(testValues[i], Write, bufDescs, bufHosts, bufZones)

And just by looking at the code of func (r *Ring) Get(key uint32, op Operation, bufDescs []InstanceDesc, bufHosts []string, bufZones map[string]int) (ReplicationSet, error) I can't find any line that's suspicious.

dosubot bot added component/ring type/flaky-test labels Jan 16, 2025

dsabsay pushed a commit to dsabsay/cortex that referenced this issue Jan 22, 2025

sort ingesters when generating token lookup table fix cortexproject#6513

7be3f84

Signed-off-by: Daniel Sabsay <[email protected]>

dsabsay linked a pull request Jan 22, 2025 that will close this issue

sort ingesters when generating token lookup table fix #6513 #6535

Open

3 tasks

dsabsay pushed a commit to dsabsay/cortex that referenced this issue Jan 23, 2025

sort ingesters when generating token lookup table fix cortexproject#6513

09711f7

Signed-off-by: Daniel Sabsay <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513

Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513

CharlieTLe commented Jan 16, 2025

dsabsay commented Jan 18, 2025

dsabsay commented Jan 22, 2025

CharlieTLe commented Jan 25, 2025

justinjung04 commented Jan 31, 2025

dsabsay commented Feb 4, 2025

dsabsay commented Feb 4, 2025

justinjung04 commented Feb 5, 2025

Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513

Flaky test: "replication_set_should_not_change_if_ring_did_not_change,_when_num_of_instance_per_zone_is_inconsistent" #6513

Comments

CharlieTLe commented Jan 16, 2025

dsabsay commented Jan 18, 2025

dsabsay commented Jan 22, 2025

CharlieTLe commented Jan 25, 2025

justinjung04 commented Jan 31, 2025

dsabsay commented Feb 4, 2025

dsabsay commented Feb 4, 2025

justinjung04 commented Feb 5, 2025