Skip to content

Latest commit

 

History

History
147 lines (102 loc) · 5.24 KB

README.md

File metadata and controls

147 lines (102 loc) · 5.24 KB
domain shortname name status editor
github.com
16/RCE
Recovery Coordination Engine
raw
Valery V. Vorotyntsev <[email protected]>

Keywords: BQ, EQ, RC, event, rule

Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Implementation

Prerequisites

  • Consul KV store MUST have eq/ and bq key prefix.

  • Hare software package MUST provide h0q CLI utility. Users SHOULD use this utility to put entries into EQ and BQ.

    Usage:

    CONSUL_ACL_TOKEN=<eq-write-token> h0q <key-prefix> <value>

Event Queue (EQ)

The EQ is the queue of incoming events that outside entities (e.g., Motr, HA, human operator) want Hare RC to know about.

The EQ is represented by Consul KV entries with eq/ prefix.

Adding event to the EQ:

CONSUL_ACL_TOKEN=<eq-write-token> hoq eq \
    '{ "type": "<event-type>", "payload": "<event-payload>" }'

Supported event types and their payload are specified in 19/EVERULES.

Recovery Coordinator (RC)

  1. On the Consul leader node there MUST be configured a watch of "keyprefix" type that watches eq/ key prefix. Whenever the EQ is modified, the handler of this watch will execute the RC and pass it full contents of the EQ in JSON format via stdin.

  2. There MUST NOT be several simultaneously running RC instances.

  3. The RC MUST process all the events in the EQ.

  4. To process an event, the RC finds the rule associated with this event type and executes it.

    1. Rules are executable files.

    2. Rules MUST obtain event payload from the standard input.

    3. Rules SHOULD have effects. E.g.: put an item into the BQ/EQ, add new entry to the system log, execute a shell command.

    Supported rules and their effects are specified in 19/EVERULES.

  5. It is RECOMMENDED to define a special _default rule. The RC SHALL apply the _default rule if there is no rule associated with the type of processed event.

  1. The RC MUST abort a rule that runs longer than predetermined <timeout>.
  2. If rule terminates with nonzero exit code, the RC SHALL log this error in the systemd log.
  1. The same set of rules MUST be installed in the same directory on every Consul server node.

    E.g., a directory of rules that handle events of types "foo", "bar", and "baz" would look like this:

rules/
 \_ _default
 \_ bar
 \_ baz
 \_ foo
  1. The RC and rules SHOULD take configuration parameters from environment variables and SHOULD NOT use command line options.

  2. The RC MUST delete processed events from the EQ.

Broadcast Queue (BQ)

The BQ is the queue of outgoing messages that Hare RC wants to be delivered to all Motr processes.

The BQ is represented by Consul KV entries with bq/ prefix.

Adding message to the BQ:

CONSUL_ACL_TOKEN=<bq-write-token> hoq bq <message>
  • Only RC SHOULD be able to modify the BQ.

  • For every Consul node there MUST be configured a watch of "keyprefix" type that watches bq/ key prefix. Whenever the BQ is modified, these watches will send full contents of the BQ in JSON format to local Hax processes over HTTP.

Examples of Fault Handling

cluster-faults

1. Disk failure

  • Detected by Motr IOS when it tries to perform I/O operation.
  • The IOS sends M0_HA_MSG_STOB_IOQ message to the local Hax.
  • Hax puts an event into the EQ. This triggers the RC.
  • The RC applies the corresponding rule.
  • The rule puts a message into the BQ.
  • Consul BQ watch handlers send HTTP request with watch invocation data (contents of the BQ) to all Haxes.
  • Upon receiving the request, each of the Haxes send the notification (M0_HA_MSG_NVEC) to the connected Motr processes.