Cluster leadership does not change when leader's backing disk freezes/becomes unresponsive #22064

mackenzieATA · 2025-01-09T23:01:16Z

Overview of the Issue

When a consul leader's disk freezes or become unresponsive leadership does not transition to the remaining instances in the cluster. It isn't detected as unhealthy.

Observed behaviour: Leadership remains unchanged but all writes to consul hang and eventually fail "agent.server: failed to wait for barrier: error="timed out enqueuing operation""

Expected behaviour: Leadership gets removed from the unhealthy instance and another instance in the cluster takes over and handles writes

Reproduction Steps

Start up servers and configure
1. I started three servers in AWS
2. On one server attach an additional volume to be used as Consul's data-dir
  1. https://docs.aws.amazon.com/ebs/latest/userguide/ebs-using-volumes.html
Install consul on servers
1. https://developer.hashicorp.com/consul/downloads I used the Linux Package Manager instructions and installed v1.20.2
Startup consul cluster
1. You'll need the IPs of all 3 servers
2. On the server with the additional volume:
  You must fill in <where you mounted the volume> <server IP> <other IP 1> and <other IP 2>
  consul agent -server -data-dir <where you mounted the volume> -node frozen_leader -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
3. On the other two servers run a variation of this command:
  You must fill in <other server 1|2> <server IP> <other IP 1> and <other IP 2> (different for server 2 vs server 3)
  consul agent -server -data-dir /tmp -node <other server 1|2> -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
Make the server with the extra volume the leader
1. consul operator raft list-peers
2. consul operator raft transfer-leader -id=<leader id>
Freeze the data dir volume with fsfreeze
1. sudo fsfreeze --freeze <path to mounted data directory>
Observer that the leader stays the same, but any write operations, e.g. consul kv put key value hang and eventually fail

Consul info for both Client and Server

Client info

No Clients

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 33e5727a
	version = 1.20.2
	version_metadata =
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.16.0.186:8300
	server = true
raft:
	applied_index = 2406
	commit_index = 2406
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2406
	last_log_term = 10
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:1a672c77-928b-0b19-b9e9-c3311459c561 Address:10.16.0.186:8300} {Suffrage:Voter ID:f2eebb9c-8728-c7a5-8832-2e14eed914f8 Address:10.16.0.150:8300} {Suffrage:Voter ID:28f90464-9f81-069f-cb0b-aed769935752 Address:10.16.0.242:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 10
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 218
	max_procs = 2
	os = linux
	version = go1.22.7
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 10
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 8
	members = 3
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 6
	members = 3
	query_queue = 0
	query_time = 1

I compared leader vs follower consul info output, they're virtually identical so I'll just include the diff of one of them:

# diff Leader_consul_info Follower_consul_info
15c15
< 	leader = true
---
> 	leader = false
22c22
< 	last_contact = 0
---
> 	last_contact = 22.842754ms
35c35
< 	state = Leader
---
> 	state = Follower
40c40
< 	goroutines = 218
---
> 	goroutines = 144

For config see startup command in reproduction steps.

Operating system and Environment details

Ubuntu 22.04.05 LTS
x86_64

Log Fragments

I've included the logs seen during an attempted consul kv put my_data4 123 outside of that there's nothing that looks interesting
These logs occur at the end of the 30 second wait on kv put.

...
2025-01-09T22:51:28.941Z [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=true
2025-01-09T22:51:29.727Z [TRACE] agent.server: rpc_server_call: method=KVS.Apply errored=true request_type=write rpc_type=net/rpc leader=true target_datacenter=dc1 locality=local
2025-01-09T22:51:29.727Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/my_data4 from=127.0.0.1:15924 error="raft apply failed: timed out enqueuing operation"
2025-01-09T22:51:29.727Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/kv/my_data4 from=127.0.0.1:15924 latency=30.000357704s
2025-01-09T22:51:29.727Z [DEBUG] agent: warning: request content-type is not supported: request-path=/v1/kv/my_data4
2025-01-09T22:51:29.728Z [DEBUG] agent: warning: response content-type header not explicitly set.: request-path=/v1/kv/my_data4
2025-01-09T22:51:30.288Z [TRACE] agent.server.usage_metrics: Starting usage run
2025-01-09T22:51:30.289Z [TRACE] agent.server: rpc_server_call: method=Status.RaftStats errored=false request_type=read rpc_type=net/rpc leader=true
...

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster leadership does not change when leader's backing disk freezes/becomes unresponsive #22064

Cluster leadership does not change when leader's backing disk freezes/becomes unresponsive #22064

mackenzieATA commented Jan 9, 2025

Cluster leadership does not change when leader's backing disk freezes/becomes unresponsive #22064

Cluster leadership does not change when leader's backing disk freezes/becomes unresponsive #22064

Comments

mackenzieATA commented Jan 9, 2025

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments