From 9abc2d0a2a54e79ed2b3ce49994cf7896a25ab90 Mon Sep 17 00:00:00 2001 From: Jwahir Sundai Date: Wed, 18 Dec 2024 16:26:59 -0600 Subject: [PATCH 1/2] adding initial pg --- docs/production-deployment/cloud/rto-rpo.mdx | 81 ++++++++++++++++++++ sidebars.js | 1 + 2 files changed, 82 insertions(+) create mode 100644 docs/production-deployment/cloud/rto-rpo.mdx diff --git a/docs/production-deployment/cloud/rto-rpo.mdx b/docs/production-deployment/cloud/rto-rpo.mdx new file mode 100644 index 0000000000..21186fb90b --- /dev/null +++ b/docs/production-deployment/cloud/rto-rpo.mdx @@ -0,0 +1,81 @@ +--- +id: rpo-rto +title: RPO and RTO - Temporal Cloud feature guide +sidebar_label: RPO and RTO +description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. Explore scenarios for Multi-Region Namespaces, Single-Region Namespaces, and Availability Zone Failures. +slug: /cloud/rpo-rto +toc_max_heading_level: 4 +keywords: + - temporal cloud + - RPO + - RTO + - Recovery Point Objective + - Recovery Time Objective +tags: + - Temporal Cloud + - Recovery Point Objective + - Recovery Time Objective +--- + +Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for Temporal Cloud can be considered within three scenarios: + +1. The near-zero RPO/20 minutes or less RTO for Temporal Cloud with Multi-Region Namespaces +2. The eight-hour RPO/RTO Temporal Cloud reports for _regional_ failures for single-region namespaces +3. The RPO/RTO Temporal Cloud guarantees for _availability zone_ failure. + +Which objective is relevant to your organization is driven by whether you map data center loss to a _regional_ loss or a _zonal_ loss. +Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider. + +## Scenario: Multi Region Namespace, Regional Failure + +Temporal Cloud offers a "Multi Region Namespace" option in private preview. +This option provides a push-button, multi-region, active-standby failover deployment for your Temporal Service. + +As Workflows progress in the active region, history events are asynchronously replicated to the standby region. +In case of an incident or outage in the active region, Temporal Cloud will fail over to the standby region so that existing Workflow Executions will continue to run and new Executions can be started. + +**Recovery Point Objective (RPO) - Near Zero** + +Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved. +The recovery point objective RPO is near-zero. +There may be a short period of time—the replication lag—during the incident when some data may be unavailable + +**Recovery Time Objective (RTO) - 20 minutes** + +Recovery time objective (RTO) for Temporal Cloud is 20 minutes or less per incident. + +## Scenario: Single Region Namespace, Regional Failure + +Temporal Cloud Namespace data is backed up by our data provider. +For a single region Namespace, data must be restored in order to recover in the event of regional failure (i.e., logical corruption). + +Temporal Cloud is beholden to our data provider backup constraints, so in this scenario it leads to the following objectives for regional failure: + +**Recovery Point Objective (RPO) - 8 hours** + +- Our data provider “snapshot” duration which is _4 hours_ +- The time window of _4 hours_ allocated to detection of corruption point before we mitigate. + +**Recovery Time Objective (RTO) - 8 hours** + +- The time window of _4 hours_ allocated to detection of corruption point. +- Our data provider restore time can be up to _4 hours_ + +## Scenario: Availability Zone Failure + +Temporal Cells are deployed in three Availability Zones (AZs) in the same region. +Our data provider is deployed with the same topology in three AZs in the same region.\ +**All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch (ES). +ES is eventually consistent, but this does not impact our RPO (there is no data loss).\ +This means there is _no_ logical corruption and restoration is done from a live replicated instance. +This applies for both single region Namespaces and multi region Namespaces.\ +This leads to the following objectives for availability zone failure: + +### Recovery Point Objective (RPO) \- 0\. + +Anything that gets committed into the zone is protected by replication in another AZ. + +### Recovery Time Objective (RTO) \- 0\. + +Temporal is active-active across AZs.\ +We are writing to at least two AZs so there is no data loss. diff --git a/sidebars.js b/sidebars.js index 34e4a3b650..6fb135644e 100644 --- a/sidebars.js +++ b/sidebars.js @@ -336,6 +336,7 @@ module.exports = { ], }, "production-deployment/cloud/multi-region", + "production-deployment/cloud/rpo-rto", { type: "category", label: "Temporal Nexus", From df2f16105de199faea82929504dd1bb365c40994 Mon Sep 17 00:00:00 2001 From: Jwahir Sundai Date: Tue, 4 Feb 2025 09:09:44 -0600 Subject: [PATCH 2/2] few edits --- docs/production-deployment/cloud/rto-rpo.mdx | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/production-deployment/cloud/rto-rpo.mdx b/docs/production-deployment/cloud/rto-rpo.mdx index 21186fb90b..3ef275379c 100644 --- a/docs/production-deployment/cloud/rto-rpo.mdx +++ b/docs/production-deployment/cloud/rto-rpo.mdx @@ -2,7 +2,7 @@ id: rpo-rto title: RPO and RTO - Temporal Cloud feature guide sidebar_label: RPO and RTO -description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. Explore scenarios for Multi-Region Namespaces, Single-Region Namespaces, and Availability Zone Failures. +description: Understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) in Temporal Cloud. slug: /cloud/rpo-rto toc_max_heading_level: 4 keywords: @@ -19,17 +19,17 @@ tags: Recovery Point Objective (RPO) and Recovery Time Objective (RTO) for Temporal Cloud can be considered within three scenarios: -1. The near-zero RPO/20 minutes or less RTO for Temporal Cloud with Multi-Region Namespaces +1. The near-zero RPO/20 minutes or less RTO for Temporal Cloud with [Multi-Region Namespaces](/cloud/multi-region) 2. The eight-hour RPO/RTO Temporal Cloud reports for _regional_ failures for single-region namespaces -3. The RPO/RTO Temporal Cloud guarantees for _availability zone_ failure. +3. The RPO/RTO Temporal Cloud guarantees for _availability zone_ failure Which objective is relevant to your organization is driven by whether you map data center loss to a _regional_ loss or a _zonal_ loss. Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider. -## Scenario: Multi Region Namespace, Regional Failure +## Multi Region Namespace, Regional Failure -Temporal Cloud offers a "Multi Region Namespace" option in private preview. -This option provides a push-button, multi-region, active-standby failover deployment for your Temporal Service. +Temporal Cloud offers "Multi Region Namespace". +Multi Region Namespace provides a push-button, multi-region, active-standby failover deployment for your Temporal Service. As Workflows progress in the active region, history events are asynchronously replicated to the standby region. In case of an incident or outage in the active region, Temporal Cloud will fail over to the standby region so that existing Workflow Executions will continue to run and new Executions can be started. @@ -44,7 +44,7 @@ There may be a short period of time—the replication lag—during the incident Recovery time objective (RTO) for Temporal Cloud is 20 minutes or less per incident. -## Scenario: Single Region Namespace, Regional Failure +## Single Region Namespace, Regional Failure Temporal Cloud Namespace data is backed up by our data provider. For a single region Namespace, data must be restored in order to recover in the event of regional failure (i.e., logical corruption). @@ -61,7 +61,7 @@ Temporal Cloud is beholden to our data provider backup constraints, so in this s - The time window of _4 hours_ allocated to detection of corruption point. - Our data provider restore time can be up to _4 hours_ -## Scenario: Availability Zone Failure +## Availability Zone Failure Temporal Cells are deployed in three Availability Zones (AZs) in the same region. Our data provider is deployed with the same topology in three AZs in the same region.\