You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a consul leader's disk freezes or become unresponsive leadership does not transition to the remaining instances in the cluster. It isn't detected as unhealthy.
Observed behaviour: Leadership remains unchanged but all writes to consul hang and eventually fail "agent.server: failed to wait for barrier: error="timed out enqueuing operation""
Expected behaviour: Leadership gets removed from the unhealthy instance and another instance in the cluster takes over and handles writes
Reproduction Steps
Start up servers and configure
I started three servers in AWS
On one server attach an additional volume to be used as Consul's data-dir
On the server with the additional volume:
You must fill in <where you mounted the volume><server IP><other IP 1> and <other IP 2> consul agent -server -data-dir <where you mounted the volume> -node frozen_leader -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
On the other two servers run a variation of this command:
You must fill in <other server 1|2> <server IP><other IP 1> and <other IP 2> (different for server 2 vs server 3) consul agent -server -data-dir /tmp -node <other server 1|2> -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
Make the server with the extra volume the leader
consul operator raft list-peers
consul operator raft transfer-leader -id=<leader id>
Freeze the data dir volume with fsfreeze
sudo fsfreeze --freeze <path to mounted data directory>
Observer that the leader stays the same, but any write operations, e.g. consul kv put key value hang and eventually fail
For config see startup command in reproduction steps.
Operating system and Environment details
Ubuntu 22.04.05 LTS
x86_64
Log Fragments
I've included the logs seen during an attempted consul kv put my_data4 123 outside of that there's nothing that looks interesting
These logs occur at the end of the 30 second wait on kv put.
Overview of the Issue
When a consul leader's disk freezes or become unresponsive leadership does not transition to the remaining instances in the cluster. It isn't detected as unhealthy.
Observed behaviour: Leadership remains unchanged but all writes to consul hang and eventually fail "agent.server: failed to wait for barrier: error="timed out enqueuing operation""
Expected behaviour: Leadership gets removed from the unhealthy instance and another instance in the cluster takes over and handles writes
Reproduction Steps
You must fill in
<where you mounted the volume>
<server IP>
<other IP 1>
and<other IP 2>
consul agent -server -data-dir <where you mounted the volume> -node frozen_leader -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
You must fill in
<other server 1|2>
<server IP>
<other IP 1>
and<other IP 2>
(different for server 2 vs server 3)consul agent -server -data-dir /tmp -node <other server 1|2> -advertise <server IP> -bind 0.0.0.0 -bootstrap-expect 3 -log-level INFO -retry-join <server IP> -retry-join <other IP 1> -retry-join <other IP 2>
consul operator raft list-peers
consul operator raft transfer-leader -id=<leader id>
fsfreeze
sudo fsfreeze --freeze <path to mounted data directory>
consul kv put key value
hang and eventually failConsul info for both Client and Server
Client info
No Clients
Server info
I compared leader vs follower
consul info
output, they're virtually identical so I'll just include the diff of one of them:For config see startup command in reproduction steps.
Operating system and Environment details
Ubuntu 22.04.05 LTS
x86_64
Log Fragments
I've included the logs seen during an attempted
consul kv put my_data4 123
outside of that there's nothing that looks interestingThese logs occur at the end of the 30 second wait on
kv put
.The text was updated successfully, but these errors were encountered: