Bug in robustness test history patching #19303

serathius · 2025-01-29T19:37:08Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

Test https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-release35-amd64/1883794260398968832 failed with panic panic: interface conversion: interface {} is nil, not model.EtcdRequest

Issue is reproducible locally from the report showing failed linearizaiton.

However, when I disabled history patching I got linearization success, implying that there is a bug in history patching.

What did you expect to happen?

Robustness test validation should not panic

How can we reproduce it (as minimally and precisely as possible)?

Follow instructions https://github.com/etcd-io/etcd/tree/main/tests/robustness#re-evaluate-existing-report on artifact from https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-etcd-robustness-release35-amd64/1883794260398968832

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
# paste output here

$ etcdctl version
# paste output here

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

The text was updated successfully, but these errors were encountered:

joshuazh-x · 2025-02-11T03:22:54Z

The panic is caused by a patched porcupine.Operation whose invocation timestamp is behind its response timestamp. This breaks the causality assumption when building linearization visual.

The root cause comes from adjusting put return time when there are multiple wal entries having same put requests (same key and value). The put request for this specific case is Put("compact_rev_key", "1055"). When iterating persisted requests in reverse order, the last occurrence of such put request will adjust its return time using the earliest observed client return time which shall actually belong to its first occurrence. This may twist following calculation and make some request's return time too earlier to before its invocation time.

etcd/tests/robustness/validate/patch_history.go

Lines 207 to 229 in ad33010

    
           for i := len(persistedRequests) - 1; i >= 0; i-- { 
        
           	request := persistedRequests[i] 
        
           	switch request.Type { 
        
           	case model.Txn: 
        
           		lastReturnTime-- 
        
           		for _, op := range request.Txn.OperationsOnSuccess { 
        
           			if op.Type != model.PutOperation { 
        
           				continue 
        
           			} 
        
           			kv := keyValue{Key: op.Put.Key, Value: op.Put.Value} 
        
           			returnTime, ok := earliestReturnTime[kv] 
        
           			if ok { 
        
           				lastReturnTime = min(returnTime, lastReturnTime) 
        
           				earliestReturnTime[kv] = lastReturnTime 
        
           			} 
        
           		} 
        
           	case model.LeaseGrant: 
        
           	case model.LeaseRevoke: 
        
           	case model.Compact: 
        
           	default: 
        
           		panic(fmt.Sprintf("Unknown request type: %q", request.Type)) 
        
           	} 
        
           }

serathius added area/robustness-testing type/bug labels Jan 29, 2025

joshuazh-x mentioned this issue Feb 11, 2025

fix: skip duplicated puts when calculating put return time #19383

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in robustness test history patching #19303

Bug in robustness test history patching #19303

serathius commented Jan 29, 2025

paste your configuration here

joshuazh-x commented Feb 11, 2025

Bug in robustness test history patching #19303

Bug in robustness test history patching #19303

Comments

serathius commented Jan 29, 2025

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

joshuazh-x commented Feb 11, 2025