Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EDU-3271: Adding RTO and RPO page under Temporal Cloud #3247

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 81 additions & 0 deletions docs/production-deployment/cloud/rto-rpo.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
id: rpo-rto
title: RPO and RTO - Temporal Cloud feature guide
sidebar_label: RPO and RTO
description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. Explore scenarios for Multi-Region Namespaces, Single-Region Namespaces, and Availability Zone Failures.
slug: /cloud/rpo-rto
toc_max_heading_level: 4
keywords:
- temporal cloud
- RPO
- RTO
- Recovery Point Objective
- Recovery Time Objective
tags:
- Temporal Cloud
- Recovery Point Objective
- Recovery Time Objective
---

Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for Temporal Cloud can be considered within three scenarios:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording feels like it can be simplified for readability. Consider throwing it into a scanner to identify opportunities for clarity.


1. The near-zero RPO/20 minutes or less RTO for Temporal Cloud with Multi-Region Namespaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You jump right in here with how RTO/RPO are used but not what they are. I think if you expand the introduction it would be a great place to set the scene for visitors to better understand the acronyms beyond what they stand for.

2. The eight-hour RPO/RTO Temporal Cloud reports for _regional_ failures for single-region namespaces
3. The RPO/RTO Temporal Cloud guarantees for _availability zone_ failure.

Which objective is relevant to your organization is driven by whether you map data center loss to a _regional_ loss or a _zonal_ loss.
jsundai marked this conversation as resolved.
Show resolved Hide resolved
Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider.

## Scenario: Multi Region Namespace, Regional Failure

Temporal Cloud offers a "Multi Region Namespace" option in private preview.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Mentioning the release stage here makes it very hard to find and update when MRN goes out of private preview.
  • I think you should consider adding a link to cloud/multi-region

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

This option provides a push-button, multi-region, active-standby failover deployment for your Temporal Service.

As Workflows progress in the active region, history events are asynchronously replicated to the standby region.
In case of an incident or outage in the active region, Temporal Cloud will fail over to the standby region so that existing Workflow Executions will continue to run and new Executions can be started.

**Recovery Point Objective (RPO) - Near Zero**

Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved.
The recovery point objective RPO is near-zero.
There may be a short period of time—the replication lag—during the incident when some data may be unavailable

**Recovery Time Objective (RTO) - 20 minutes**

Recovery time objective (RTO) for Temporal Cloud is 20 minutes or less per incident.

## Scenario: Single Region Namespace, Regional Failure

Temporal Cloud Namespace data is backed up by our data provider.
jsundai marked this conversation as resolved.
Show resolved Hide resolved
For a single region Namespace, data must be restored in order to recover in the event of regional failure (i.e., logical corruption).

Temporal Cloud is beholden to our data provider backup constraints, so in this scenario it leads to the following objectives for regional failure:

**Recovery Point Objective (RPO) - 8 hours**

- Our data provider “snapshot” duration which is _4 hours_
- The time window of _4 hours_ allocated to detection of corruption point before we mitigate.
jsundai marked this conversation as resolved.
Show resolved Hide resolved

**Recovery Time Objective (RTO) - 8 hours**

- The time window of _4 hours_ allocated to detection of corruption point.
- Our data provider restore time can be up to _4 hours_

## Scenario: Availability Zone Failure

Temporal Cells are deployed in three Availability Zones (AZs) in the same region.
jsundai marked this conversation as resolved.
Show resolved Hide resolved
Our data provider is deployed with the same topology in three AZs in the same region.\
**All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch (ES).
ES is eventually consistent, but this does not impact our RPO (there is no data loss).\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not impact our RPO (there is no data loss).
Might change it to "losing an AZ will not result in data loss or unavailability of Temporal service"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the backslashes here?

This means there is _no_ logical corruption and restoration is done from a live replicated instance.
This applies for both single region Namespaces and multi region Namespaces.\
This leads to the following objectives for availability zone failure:

### Recovery Point Objective (RPO) \- 0\.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the following header could be misinterpreted. If this means no data loss and instant recovery, consider clarifying. Kapa says "the RTO is stated to be zero, meaning there should be no downtime in such scenarios."


Anything that gets committed into the zone is protected by replication in another AZ.

### Recovery Time Objective (RTO) \- 0\.

Temporal is active-active across AZs.\
We are writing to at least two AZs so there is no data loss.
1 change: 1 addition & 0 deletions sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,7 @@ module.exports = {
],
},
"production-deployment/cloud/multi-region",
"production-deployment/cloud/rpo-rto",
{
type: "category",
label: "Temporal Nexus",
Expand Down
Loading