From 56c607033346abb4c2ef26decfdd198653adeea9 Mon Sep 17 00:00:00 2001 From: Erica Sadun Date: Mon, 3 Feb 2025 13:33:55 -0700 Subject: [PATCH] EDU-3661: Monday Feb 3 Updates with Nikitha feedback - Some significant shifting of content - Info should be more readily consumed in order - Landing page is still not reconciled --- .../high-availability/best-practices.mdx | 218 +++++++++++++ .../cloud/high-availability/enable.mdx | 296 ++++-------------- .../cloud/high-availability/failovers.mdx | 73 +++++ .../cloud/high-availability/faq.mdx | 95 +++--- .../cloud/high-availability/guarantees.mdx | 61 ---- .../cloud/high-availability/how-it-works.mdx | 97 +++--- .../cloud/high-availability/index.mdx | 102 ++++-- .../cloud/high-availability/monitor.mdx | 81 ----- sidebars.js | 4 +- 9 files changed, 521 insertions(+), 506 deletions(-) create mode 100644 docs/production-deployment/cloud/high-availability/best-practices.mdx create mode 100644 docs/production-deployment/cloud/high-availability/failovers.mdx delete mode 100644 docs/production-deployment/cloud/high-availability/guarantees.mdx delete mode 100644 docs/production-deployment/cloud/high-availability/monitor.mdx diff --git a/docs/production-deployment/cloud/high-availability/best-practices.mdx b/docs/production-deployment/cloud/high-availability/best-practices.mdx new file mode 100644 index 0000000000..851876b676 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/best-practices.mdx @@ -0,0 +1,218 @@ +--- +id: best-practices +title: best-practices +sidebar_label: Best Practices +slug: /cloud/high-availability/best-practices +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +This page collects best practices related to high availability Namespaces. + +- [Preparing Worker deployment](/cloud/high-availability/best-practices#worker-deployment) +- [Set up secure routing for failovers](/cloud/high-availability/best-practices#routing) +- [PrivateLink routing](/cloud/high-availability/best-practices#privatelink-routing) +- [High availability Failover testing](/cloud/high-availability/best-practices#testing) +- [Monitoring replication lag metrics](/cloud/high-availability/best-practices#metrics) +- [Viewing operational events](/cloud/high-availability/best-practices#events) + +## High availability Failover testing {#testing} + +Regular failover testing ensures your app can handle disruptions and continue running smoothly in production. + +Microservices and external dependencies will fail at some point. +Testing failovers ensures your app can handle these failures effectively. + +Temporal recommends regular and periodic failover testing for mission-critical applications in production. +By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail. + +:::tip Safety First + +If this is your first time performing a failover test, run it with a test-specific namespace and application. +This helps you gain operational experience before applying it to your production environment. +Practice runs help ensure the process runs smoothly during real incidents in production. + +::: + +Trigger testing can: + +- **Validate high-availability deployments**: + In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. + In single-region setups, failover testing instead works with isolation domain. + This maintains high availability in mission-critical deployments. + Manual testing confirms the failover mechanism works as expected, so your system handles incidents effectively. + +- **Assess replication lag**: + In multi-region deployment, monitoring [replication lag](#metrics) between regions is crucial. + Check the lag before initiating a failover to avoid rolling back Workflow progress. + This is less important when using isolation domains as failover is usually instantaneous. + Manual testing helps you practice this critical step and understand its impact. + When there's no real incident, the switch over (recovery) should happen almost instantly. + A switch over within a single region should also be nearly instantaneous. + +- **Assess recovery time**: + Manual testing helps you measure actual recovery time. + You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the [High availability Namespace SLA](/cloud/high-availability#sla). + +- **Identify potential issues**: + Failover testing uncovers problems not visible during normal operation. + This includes issues like [backlogs and capacity planning](https://temporal.io/blog/workers-in-production#testing-failure-paths-2438) and how external dependencies behave during a failover event. + +- **Validate fault-oblivious programming**: + Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. + Testing failovers ensures that this model works as expected in your app. + +- **Operational readiness**: + Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise. + +Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails. + +## Preparing Worker deployment {#worker-deployment} + +Enabling high availability for Namespaces doesn't require specific Worker configuration. +The process is invisible to the Workers. +When a Namespace fails over to the replica, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. +More details are available in the [Routing](/cloud/high-availability/best-practices#routing) section. + +- When a Namespace fails over to a replica in a different region, Workers will be communicating cross-region. + If your application can’t tolerate this latency, deploy a second set of Workers in this region or opt for a replica in the same region. + +- In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. + To keep Workflows moving during this level of outage, deploy a second set of Workers to the secondary region. + +## Set up secure routing for failovers {#routing} + +
**This section needs fixing for regions vs isolation domains**
+ +When using a high availability Namespace, the Namespace's DNS record `..` targets a regional DNS record in the format `.region.`. +Here, `` is the currently active region for your Namespace. +Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record. + +During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. +Namespace DNS records are configured with a 15 seconds TTL. +Any DNS cache should re-resolve the record within this delay. +As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. +Clients should converge to the newly targeted region within, at, most a 30-second delay. + +## PrivateLink routing {#privatelink-routing} + +
**This section needs fixing for regions vs isolation domains**
+ +:::important + +Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. +This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only. + +::: + +PrivateLink customers may need to change certain configurations for multi-region Namespace use. +Routing configuration depends on networking setup and use of PrivateLink. +You may need to: + +- override a DNS zone; and +- ensure the network connectivity between the two regions. + +![Customer side solution example](/img/multi-region/private-link.png) + +When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. +The `region.` zone is configured in the Temporal systems as an independent zone. +This allows you to override it to make sure traffic is routed internally for the regions in use. +You can check the Namespace's active region using the Namespace record CNAME, which is public. + +To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). +Using AWS, this can be done using a private hosted zone in Route53 for `region.`. +Link that private zone to the VPCs you use for Workers. +Private Link is not yet offered for GCP multi-region Namespaces. + +When your Workers connect to the Namespace, they first resolve the `..` record. +This targets `.region.` using a CNAME. +Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using. + +Consider how you'll configure Workers to run in this scenario. +You might set Workers to run in both regions at all times. +Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs. + +The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. +The `sa-east-1` region listed here is not yet available for use with multi-region Namespaces. + +| Region | PrivateLink Service Name | DNS Record Override | +| ---------------- | -------------------------------------------------------------- | --------------------------------------- | +| `ap-northeast-1` | `com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48a` | `aws-ap-northeast-1.region.tmprl.cloud` | +| `ap-northeast-2` | `com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308` | `aws-ap-northeast-2.region.tmprl.cloud` | +| `ap-south-1` | `com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662` | `aws-ap-south-1.region.tmprl.cloud` | +| `ap-south-2` | `com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1` | `aws-ap-south-2.region.tmprl.cloud` | +| `ap-southeast-1` | `com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccd` | `aws-ap-southeast-1.region.tmprl.cloud` | +| `ap-southeast-2` | `com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08` | `aws-ap-southeast-2.region.tmprl.cloud` | +| `ca-central-1` | `com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9d` | `aws-ca-central-1.region.tmprl.cloud` | +| `eu-central-1` | `com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3` | `aws-eu-central-1.region.tmprl.cloud` | +| `eu-west-1` | `com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739` | `aws-eu-west-1.region.tmprl.cloud` | +| `eu-west-2` | `com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695` | `aws-eu-west-2.region.tmprl.cloud` | +| `sa-east-1` | `com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525a` | `aws-sa-east-1.region.tmprl.cloud` | +| `us-east-1` | `com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37f` | `aws-us-east-1.region.tmprl.cloud` | +| `us-east-2` | `com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4` | `aws-us-east-2.region.tmprl.cloud` | +| `us-west-2` | `com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94` | `aws-us-west-2.region.tmprl.cloud` | + +## Monitoring replication lag metrics {#metrics} + +Replication lag refers to the transmission delay of Workflow updates and history events from an active to a standby Namespace. +A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. +Always check the metric replication lag before initiating a high availability failover, especially when working with multi-region deployment. + +Temporal Cloud emits three replication lag-specific [metrics](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket). + +- `temporal_cloud_v0_replication_lag_bucket`: + A histogram of replication lag during a specific time interval for a high availability Namespace. +- `temporal_cloud_v0_replication_lag_count`: + The replication lag count during a specific time interval for a high availability Namespace. +- `temporal_cloud_v0_replication_lag_sum`: + The sum of replication lag during a specific time interval for a high availability Namespace. + +The following samples demonstrate how you can use these metrics to explore replication lag. + +### P99 replication lag histogram + +``` +histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le)) +``` + +### Average replication lag + +``` +sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace) +/ +sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace) +``` + +## Viewing operational events {#events} + +You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. +For example, during the process of adding a region or isolation domain to a Namespace, you can see the progress of Workflow replication. +Errors -- if any occur -- will also surface in the Namespace Web UI. + +:::info + +You may notice that high-availability Namespaces shows twice (2x) the Action count in `temporal_cloud_v0_total_action_count`. +This doubling happens due to regional replication. + +::: + +Temporal Cloud provides several ways to audit events: + +- When Temporal triggers failovers, the audit log updates with details. + Look specifically for `"operation": "FailoverNamespace"` in the logs. +- You can set alerts for Temporal-initiated failover events. +- After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI. + + diff --git a/docs/production-deployment/cloud/high-availability/enable.mdx b/docs/production-deployment/cloud/high-availability/enable.mdx index 839b0fba79..bf20ab3d2e 100644 --- a/docs/production-deployment/cloud/high-availability/enable.mdx +++ b/docs/production-deployment/cloud/high-availability/enable.mdx @@ -1,8 +1,8 @@ --- id: enable -title: Enable high availability +title: Enable High Availability features sidebar_label: Enable high availability -slug: /cloud/high-availability/choosing-high-availability +slug: /cloud/high-availability/enable description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. tags: - Temporal Cloud @@ -21,276 +21,108 @@ keywords: --- import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; -:::tip Support, stability, and dependency info +You enable High Availability features for a new or existing Namespace by adding a replica to the Namespace. +When you add a replica, Temporal Cloud begins replicating ongoing and existing Workflows. +Once the replication has completed and the replica is ready, your Namespace is ready for failover. -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. +- [Create a high availability Namespace](/cloud/high-availability/enable#create) +- [Upgrade an existing Namespace to high availability functionality](/cloud/high-availability/enable#upgrade) +- [Discontinuing high availability replicas](/cloud/high-availability/enable#discontinuing) -::: - -
**Some audits, updates. Needs intros, re-org. Suggest breaking down into "opting in", "setting up (worker and privatelink)", and "testing" because the content feels really mixed up right now and too long and the metrics section is now a little too short**
- -You can enable the high-availability Namespace feature for your existing Namespace by [adding a second zone](#add-zones) to your Namespace. -After adding the second zone, Temporal Cloud begins data replication for your new standby replica. -Temporal Cloud notifies you once the replication has caught up and both Namespace zones are in sync. - -**Advantages of using a high-availability Namespace:** - -- No manual deployment or configuration needed, just simple push-button operation. -- Open Workflows continue in the standby region with minimal interruption and data loss. -- No changes needed for Worker and Workflow code during setup or failover. -- 99.99% Contractual SLA. +## Create a high availability Namespace {#create} -### Create a multi-region Namespace {#create} +The following sections explain how to create a Namespace with a replica. +You can create a replica within the current region or deploy the replica to a different region. -The following sections explain how to create a new multi-region Namespace (MRN). -MRNs provide multi-region deployment backed by Temporal's data replication and active-standby features. +::: note -:::tip - -While reading through this coverage, remember that pairing is currently limited to regions within the same continent. +While reading through this coverage, be aware that replication is not supported in all regions. +For multi-region replication, pairing is limited to regions within the same continent. +For more details, refer to ["Regional availability"].(/cloud/high-availabilityregional-availability). ::: -#### Temporal Cloud Web UI +### Temporal Cloud Web UI -During Namespace creation, specify the first region for the Namespace. -Then, select the “Add a region” option. -Adding a second region enables multi-region Namespace capabilities. +Follow these steps to add replication to your Temporal Cloud Namespace: -#### Temporal 'tcld' CLI +1. During Namespace creation, specify the first region for the Namespace. +2. Select the “Add a replica” option. + Adding a replica in the same region enables Replication. + Adding a replica in a second region enables Multi-region Replication. -Start with the following command to create the new multi-region Namespace: +### Temporal 'tcld' CLI -``` +Enter the following at the command-line to create a replicated Namespace: + +```sh tcld namespace create \ - --namespace . \ - --region + --namespace . \ + --region \ + --region ``` -Include both regions by specifying the [region codes](/cloud/service-availability) as arguments to the `--region` flags. -Before pressing return, add your authentication credentials. For example, `--ca-certificate-file `. - - +Specify the [region codes](/cloud/service-availability) as arguments to the two `--region` flags. +- Using the same region replicates to an isolation zone within that region. +- Using a different region within the same continent creates a multi-region Namespace. -## Upgrade an existing single-zone Namespace for high-availability functionality {#add-zones} +Before pressing return, add your authentication credentials. +For example, `--ca-certificate-file `. -You can upgrade existing ssingle-zone Namespace for high-availability by adding a standby zone. -The following sections show you how. +## Upgrade an existing Namespace to high availability functionality {#upgrade} -
**The following material has not been audited for MRN/HAN**
+Upgrade an existing single-region Namespace to high availability features by establishing a replica. +The following sections explain how. +You can either create a replica within the current region or deploy the replica to a different region. -#### Temporal Cloud Web UI +### Temporal Cloud Web UI -To upgrade an existing Namespace to a multi-region Namespace: +Follow these steps to upgrade an existing Namespace: -1. Visit Temporal Cloud [Namespaces](https://cloud.temporal.io/namespaces) in your Web browser +1. Visit Temporal Cloud Namespaces in your Web browser 1. Navigate to the Namespace details page -1. Select the “Add a region” button. -1. Select the standby region you want to add to this Namespace +1. Select the “Add a replica” button. +1. Choose either **Replication** (in the same region) or **Multi-region Replication** (across regions). + If you select Multi-region Replication, specify which region -You will see an estimated time for replication. -This time is based on your selection and the size and scale of Workflows in your Namespace, -An email alert is sent once your multi-region Namespace is ready for use. +The web interface will present an estimated time for replication to complete. +This time is based on your selection and the size and scale of the Workflows in your Namespace. +An email alert is dispatched once your highly available Namespace is ready for use. -#### Temporal 'tcld' CLI +### Temporal 'tcld' CLI -At the command line, enter: +Enter the following at the command-line to upgrade a Namespace for replication: -``` +```sh tcld namespace add-region \ --namespace . \ --region ``` -Specify the region code for the new region to add. -Before pressing return, add your authentication credentials. For example, `--ca-certificate-file `. -An email alert is sent once your multi-region Namespace is ready for use. - -### Discontinuing multi-region availability {#discontinuing} - -Disabling multi-region removes the high availability and automatic failover features that provide Temporal's highest service level agreement. -To disable the feature and end charges, users must contact [Temporal Support](https://support.temporal.io) directly. -MRN-specific charges for replication will stop once this decommissioning procedure completes. - -- When making your request you must let us know which region you want the Namespace to land in after removing the standby region. -- If you cease services in the middle of the month, your Namespace will be converted to a single region Namespace within 1 business day. -- Temporal won't retain replicated data in the standby region once multi-region has been disabled. -- After disabling multi-region, Temporal Cloud cannot re-enable the feature for a given Namespace for seven days. - -## Triggering failovers {#triggering-failovers} - -Failovers happen automatically in Temporal when a regional outage or disaster affects a multi-region Namespace. -You can also trigger a failover based on custom alerts or for testing purposes. -This section explains how to manually trigger a failover and what to expect afterward. - -Regular failover testing ensures your app can handle disruptions and continue running smoothly in production. -Whether responding to incident warnings or conducting tests, follow the steps in the next sections to move your active Namespace to its standby region and learn how to handle failovers effectively. - -For details on how Temporal detects conditions and triggers failovers automatically, see [Failovers](/cloud/multi-region/#failovers). - -:::warning Check Your Replication Lag - -Always check the [metric replication lag](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket) before initiating a failover. -A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. - -::: - -**Performing manual failovers** - -You can trigger a failover manually using the Temporal Cloud Web UI or the `tcld` CLI, depending on your preference and setup. -The following table outlines the steps for each method: - -| Method | Instructions | -| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Temporal Cloud Web UI** | 1. Visit the [Namespace page](https://cloud.temporal.io/namespaces) on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the **Trigger a failover** option from the menu.
3. After confirming, the failover will be initiated. | -| **Temporal `tcld` CLI** | To manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace \.\ \
    --region \ | - -**Post-failover event information** - -After any failover, whether triggered by you or by Temporal, event information appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs. -The audit log entry for Failover uses the `"operation": "FailoverNamespace"` event. -After failover, the Namespace is active in the new region. - -You don't need to monitor Temporal Cloud's failover response in real-time. -Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email. - -**Failbacks** - -After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region that was active before the incident (a "failback") once the incident is resolved. - -**Reasons to test failing over** +Specify the added [region code](/cloud/service-availability) as an argument to the `--region` flag. -Microservices and external dependencies will fail at some point. -Testing failovers ensures your app can handle these failures effectively. -Temporal recommends regular and periodic failover testing for mission-critical applications in production. -By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail. +- Using the current region replicates to an isolation zone within your existing region. +- Using a different region within the same continent creates a multi-region Namespace. -:::tip Safety First +Before pressing return, add your authentication credentials. +For example, `--ca-certificate-file `. -If this is your first time performing a failover test, run it with a test-specific namespace and application. -This helps you gain operational experience before applying it to your production environment. -Practice runs help ensure the process runs smoothly during real incidents in production. - -::: - -Trigger testing can: - -- **Validate multi-region deployments**: - In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. - This maintains high availability in mission-critical deployments. - Manual testing confirms the failover mechanism works as expected, so your system handles regional outages or disasters effectively. - -- **Assess replication lag**: - Monitoring [replication lag](/cloud/multi-region#metrics-operations) between regions is crucial in multi-region setups. - Check the lag before initiating a failover to avoid rolling back Workflow progress. - Manual testing helps you practice this critical step and understand its impact. - When there's no real incident, the switch over (recovery) should happen almost instantly. - -- **Assess recovery time**: - Manual testing helps you measure actual recovery time. - You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the [Multi-region Namespace SLA](/cloud/multi-region#sla). - -- **Identify potential issues**: - Failover testing uncovers problems not visible during normal operation. - This includes issues like [backlogs and capacity planning](https://temporal.io/blog/workers-in-production#testing-failure-paths-2438) and how external dependencies behave during a failover event. - -- **Validate fault-oblivious programming**: - Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. - Testing failovers ensures that this model works as expected in your app. - -- **Operational readiness**: - Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise. - -Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails. - - -## Worker Deployment {#worker-deployment} - -Enabling the multi-region Namespace does not require specific Worker configuration. -The process is invisible to the Workers. -When a Namespace fails over to the standby region, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. -More details are available in the [Routing](/cloud/multi-region#routing) section below. - -:::info - -- When a Namespace fails over to a standby region, Workers will be communicating cross-region. - -- In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. - To keep Workflows moving during this level of outage, deploy a second set of Workers to your standby region. - -::: - -## Routing {#routing} - -When using multi-region for a Namespace, the Namespace's DNS record `..` targets a regional DNS record in the format `.region.`. -In this format, `` is the currently active region for your Namespace. -Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record. - -During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. -Namespace DNS records are configured with a 15 seconds TTL. -Any DNS cache should re-resolve the record within this delay. As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. -Clients should converge to the newly targeted region within, at, most a 30-second delay. - - -## PrivateLink routing {#privatelink-routing} - -:::important - -Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. -This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only. - -::: - -PrivateLink customers may need to change certain configurations for multi-region Namespace use. -Routing configuration depends on networking setup and use of PrivateLink. -You may need to: - -- override a DNS zone; and -- ensure the network connectivity between the two regions. - -![Customer side solution example](/img/multi-region/private-link.png) - -When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. -The `region.` zone is configured in the Temporal systems as an independent zone. -This allows you to override it to make sure traffic is routed internally for the regions in use. -You can check the Namespace's active region using the Namespace record CNAME, which is public. - -To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). -Using AWS, this can be done using a private hosted zone in Route53 for `region.`. -Link that private zone to the VPCs you use for Workers. -Private Link is not yet offered for GCP multi-region Namespaces. +An email alert is sent once your multi-region Namespace is ready for use. -When your Workers connect to the Namespace, they first resolve the `..` record. -This targets `.region.` using a CNAME. Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using. +## Discontinuing high availability replicas {#discontinuing} -Consider how you'll configure Workers to run in this scenario. -You might set Workers to run in both regions at all times. -Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs. +Removing a Namespace replica removes the high availability and automatic failover features that provide Temporal's highest service level agreement. +To disable these features and end charges: -The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. -The `sa-east-1` region listed here is not yet available for use with multi-region Namespaces. +1. Navigate to the Namespace details page in Temporal Cloud. +2. On the “Region” card, select the option to “Remove Replica”. -| Region | PrivateLink Service Name | DNS Record Override | -| ---------------- | -------------------------------------------------------------- | --------------------------------------- | -| `ap-northeast-1` | `com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48a` | `aws-ap-northeast-1.region.tmprl.cloud` | -| `ap-northeast-2` | `com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308` | `aws-ap-northeast-2.region.tmprl.cloud` | -| `ap-south-1` | `com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662` | `aws-ap-south-1.region.tmprl.cloud` | -| `ap-south-2` | `com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1` | `aws-ap-south-2.region.tmprl.cloud` | -| `ap-southeast-1` | `com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccd` | `aws-ap-southeast-1.region.tmprl.cloud` | -| `ap-southeast-2` | `com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08` | `aws-ap-southeast-2.region.tmprl.cloud` | -| `ca-central-1` | `com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9d` | `aws-ca-central-1.region.tmprl.cloud` | -| `eu-central-1` | `com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3` | `aws-eu-central-1.region.tmprl.cloud` | -| `eu-west-1` | `com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739` | `aws-eu-west-1.region.tmprl.cloud` | -| `eu-west-2` | `com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695` | `aws-eu-west-2.region.tmprl.cloud` | -| `sa-east-1` | `com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525a` | `aws-sa-east-1.region.tmprl.cloud` | -| `us-east-1` | `com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37f` | `aws-us-east-1.region.tmprl.cloud` | -| `us-east-2` | `com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4` | `aws-us-east-2.region.tmprl.cloud` | -| `us-west-2` | `com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94` | `aws-us-west-2.region.tmprl.cloud` | +The replica will be deleted and your Namespace will no longer be highly available. +You will no longer be charged for this feature. -:::tip Learn more about multi-region Namespaces +:::note -If you have more questions or feedback about this feature, reach out to the product team. +After removing a replica, Temporal Cloud can't re-enable replication in the same region for a given Namespace for seven days. ::: - diff --git a/docs/production-deployment/cloud/high-availability/failovers.mdx b/docs/production-deployment/cloud/high-availability/failovers.mdx new file mode 100644 index 0000000000..e4e827eea9 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/failovers.mdx @@ -0,0 +1,73 @@ +--- +id: failovers +title: failovers +sidebar_label: Failovers +slug: /cloud/high-availability/failovers +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +A failover shifts Workflow Execution processing from an active Temporal Namespace to replicated Temporal Namespace during outages or other incidents. +Standby Namespace replicas duplicate data and prevent data loss during failover. + +## Triggering failovers {#triggering-failovers} + +Temporal automatically initiates failovers when an incident or outage affects a high availability Namespace. +You can also [trigger a failover](/cloud/high-availability/failovers#triggering-failovers) based on your own custom alerts and for testing purposes. +This section explains how to manually trigger a failover and what to expect afterward. + +:::warning Check Your Replication Lag + +Always check the [metric replication lag](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket) before initiating a failover. +A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. + +::: + +For details on how Temporal detects conditions and triggers failovers automatically, see [the failover process](/cloud/high-availability/failovers). + +### Initiating manual failovers {#manual-failovers} + +You can trigger a failover manually using the Temporal Cloud Web UI or the `tcld` CLI, depending on your preference and setup. +The following table outlines the steps for each method: + +
**Need to update the CLI instructions**
+| Method | Instructions | +| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Temporal Cloud Web UI** | 1. Visit the [Namespace page](https://cloud.temporal.io/namespaces) on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the **Trigger a failover** option from the menu.
3. After confirmation, Temporal initiates the failover. | +| **Temporal `tcld` CLI** | To manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace \.\ \
    --region \
Temporal fails over the Namespace to the target region. High availability Namespaces using a single region will failover to the standby isolation domain. | + +### Post-failover event information + +After any failover, whether triggered by you or by Temporal, event information appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs. +The audit log entry for Failover uses the `"operation": "FailoverNamespace"` event. +After failover, the replica becomes active, taking over and the Namespace is active in the new isolation domain or region. + +You don't need to monitor Temporal Cloud's failover response in real-time. +Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email. + +### Failbacks + +After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region or isolation zone that was active before the incident once the incident is resolved. +This is called a "failback". + +## Disabling Temporal-initiated failovers + +When you add a replica to a Namespace, it becomes a high availability Namespace. +In the event of an incident or an outage Temporal Cloud automatically fails over a high availability Namespaces to its replica. +_This is the recommended and default option_. + +If you prefer to disable Temporal-initiated failovers and handle your own failovers, you can do so by navigating to the Namespace detail page in Temporal Cloud. Choose the "Disable Temporal-initiated failovers" option. diff --git a/docs/production-deployment/cloud/high-availability/faq.mdx b/docs/production-deployment/cloud/high-availability/faq.mdx index 85b75a94b4..b18afddbc6 100644 --- a/docs/production-deployment/cloud/high-availability/faq.mdx +++ b/docs/production-deployment/cloud/high-availability/faq.mdx @@ -21,32 +21,27 @@ keywords: --- import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; -:::tip Support, stability, and dependency info - -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. - -::: -
**Repurposed material. No audits, updates, intros, re-org. Converted to markdown automatically from Google Doc rich text, so there are many errors**
Failovers -**Q: What is a failover****** +**Q: What is a failover** A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby Temporal Namespace during outages or other incidents. Standby Namespaces use replication to duplicate data and prevent data loss during failover. -**Q: What failover modes does Temporal use internally?****** +**Q: What failover modes does Temporal use internally?** Users cannot configure failover modes. The following descriptions explain Temporal Cloud’s internal failover system: - * ******[Graceful failover**](https://docs.temporal.io/cloud/multi-region#graceful-failover): Replication tasks are fully processed and drained before transferring control to the standby region. Temporal Cloud pauses traffic to the active Namespace before the failover, minimizing the rewind of progress and avoiding data conflicts. The Namespace experiences a short period of unavailability, defaulting to 10 seconds at most. Under most circumstances, the actual time the Namespace is unavailable is much, much shorter than that. + * **[Graceful failover**](/cloud/high-availability/how-it-works#graceful-failover): Replication tasks are fully processed and drained before transferring control to the standby region. Temporal Cloud pauses traffic to the active Namespace before the failover, minimizing the rewind of progress and avoiding data conflicts. The Namespace experiences a short period of unavailability, defaulting to 10 seconds at most. Under most circumstances, the actual time the Namespace is unavailable is much, much shorter than that. During this period, existing Workflows stop progress. Temporal Cloud returns a "Service unavailable error". State transitions will not happen and tasks are not dispatched. User requests like start/signal Workflow will be rejected while operations are paused during handover. This mode favors _consistency_ over availability. - * ******[Forced failover**](https://docs.temporal.io/cloud/multi-region#forced-failover): In this mode, a Namespace immediately activates in the standby region. Events not replicated due to replication lag will undergo conflict resolution upon reaching the new active region. This mode prioritizes _availability_ over consistency. - * ******[Hybrid failover**](https://docs.temporal.io/cloud/multi-region#hybrid-failover) (Default mode): While graceful failovers are consistent, they aren’t always practical in certain circumstances such as when cells experience outages and/or a critical database is unavailable. Temporal Cloud’s hybrid failover mode limits an initial Graceful failover attempt to 10 seconds or less. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements. This strategy balances _consistency_ and _availability_ requirements. + * **[Forced failover**](/cloud/high-availability/how-it-works#forced-failover): In this mode, a Namespace immediately activates in the standby region. Events not replicated due to replication lag will undergo conflict resolution upon reaching the new active region. This mode prioritizes _availability_ over consistency. + + * **[Hybrid failover**](/cloud/high-availability/how-it-works#hybrid-failover) (Default mode): While graceful failovers are consistent, they aren’t always practical in certain circumstances such as when cells experience outages and/or a critical database is unavailable. Temporal Cloud’s hybrid failover mode limits an initial Graceful failover attempt to 10 seconds or less. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements. This strategy balances _consistency_ and _availability_ requirements. -**Q: What is the difference between a handover and a failover?****** +**Q: What is the difference between a handover and a failover?** They are essentially the same thing. It is the process of transferring control from the active to the standby region during outages or other incidents. @@ -59,11 +54,11 @@ They are essentially the same thing. It is the process of transferring control f -**Q: What situation triggers a graceful failover vs a forced one? Who or what triggers it? What are the differences in results?****** +**Q: What situation triggers a graceful failover vs a forced one? Who or what triggers it? What are the differences in results?** Users can initiate a failover, but they can’t control or configure the failover mode. The three failover modes are internal operations on the Temporal side. They are explained for user education. -**Q: Under what circumstances would I want to initiate a failover?****** +**Q: Under what circumstances would I want to initiate a failover?** Normally, we don't expect users to failover when there’s a problem related to Temporal Cloud. Please contact Temporal support if you feel you have a pressing need. You might consider initiating a failover under two circumstances: @@ -72,67 +67,67 @@ Normally, we don't expect users to failover when there’s a problem related to You can still choose to initiate failover if you have issues sourced from your side or your dependencies. -**Q: Under what circumstances does Temporal initiate failovers?****** +**Q: Under what circumstances does Temporal initiate failovers?** Temporal Cloud initiates failovers when there are incidents or outages in the cloud provider. This includes failures of databases, storage, etc. We trigger failovers any time we observe increased latencies or an increase in service errors that causes us to violate the SLA that is in our control. -**Q: Are there any other types of failover not listed above?****** +**Q: Are there any other types of failover not listed above?** No. There are only three types. -**Q: Can we control the hybrid failover timeout?****** +**Q: Can we control the hybrid failover timeout?** No. The timeout is not configurable outside of Temporal. -**Q: Can a failover get stuck? What is the maximum amount of time it can take?****** +**Q: Can a failover get stuck? What is the maximum amount of time it can take?** There is typically no way for failovers to get “stuck”. We follow the hybrid failover method where we try to do a smooth handoff. If that does not take place within 10 seconds, we initiate a “forced” failover. -**Q: Is the 10 seconds maximum unavailability window configurable?****** +**Q: Is the 10 seconds maximum unavailability window configurable?** No, it is not configurable by the user. Extending the wait time is unlikely to increase the chances of graceful failovers during extreme incidents such as when a source region is down. -**Q: How does the client detect that the failover has occurred?****** +**Q: How does the client detect that the failover has occurred?** We do not send real-time failover notifications. Users are notified via email and audit logs. -**Q: Can the customer determine the resolved failover region?****** +**Q: Can the customer determine the resolved failover region?** Users can determine the failover region from the Namespace endpoint’s CNAME (\.tmprl.cloud). Whenever Temporal Cloud triggers a failover from the Temporal side, we update the CNAME to point to the new active region. The CNAME points to a Temporal Cloud regional endpoint. For example, a Namespace active in aws us-east-1 points to aws-**us-east-1**.region.tmprl.cloud. Replication Lag and Latency -**Q: What affects replication latency? What can cause the replication latency to increase?****** +**Q: What affects replication latency? What can cause the replication latency to increase?** -Slowdowns in the standby cell, such as capacity issues or outages, can increase replication latency. Otherwise, it is typically a matter of seconds or even less and can be monitored through  [external metrics](https://docs.temporal.io/cloud/multi-region#metrics-operations). +Slowdowns in the standby cell, such as capacity issues or outages, can increase replication latency. Otherwise, it is typically a matter of seconds or even less and can be monitored through  [external metrics](/cloud/high-availability/best-practices#metrics). -**Q: Can Workflows execute events a second time in the standby Cluster due to replication lag?****** +**Q: Can Workflows execute events a second time in the standby Cluster due to replication lag?** -Yes. This is explained in the [conflict resolution](https://docs.temporal.io/cloud/multi-region#conflict-resolution) section of our documentation. +Yes. This is explained in the [conflict resolution](/cloud/high-availability/how-it-works#conflict-resolution) section of our documentation. -**Q: Is it possible to see whether a failover was graceful, forced, or hybrid?****** +**Q: Is it possible to see whether a failover was graceful, forced, or hybrid?** No, customers cannot normally view the method used. File a support ticket if there’s a specific need to review a process. -**Q: Is replication lag emitted as a metric?****** +**Q: Is replication lag emitted as a metric?** -Yes, replication lag is a [metric](https://docs.temporal.io/cloud/multi-region#metrics-operations) that we expose. +Yes, replication lag is a [metric](/cloud/high-availability/best-practices#metrics) that we expose. -**Q: Can we see replication information by Workflow type or ID?****** +**Q: Can we see replication information by Workflow type or ID?** _[Answer in progress]___ -**Q: Is the data replicated in order?****** +**Q: Is the data replicated in order?** For a single Workflow, events are replicated  in order. There's no ordering guarantee for replication of events between different Workflows. -**Q: What happens if both regions become active simultaneously?****** +**Q: What happens if both regions become active simultaneously?** This only happens when there's a network partition or delays in the Namespace replication queue. Normally, when cells can talk to each other, only one region will ever become active. If both regions have become active and both have active Workers, Workflows will run independently based on their local History. Workers fetch tasks from their assigned region. With global Worker setups, Workers fetch tasks from the ‘true’ active region as known by Temporal Cloud. Eventually, when the network partition heals, History is merged via conflict resolution and one side wins. -**Q: What if DNS is still updating during a network partition between Clusters? ****** +**Q: What if DNS is still updating during a network partition between Clusters? ** In this situation, the now passive Cluster can’t forward requests to the new active Cluster. However, DNS normally points to the correct active Cluster without forwarding. Workers configured to point to the standby Cluster can be reconfigured to point to the active Cluster. @@ -140,47 +135,47 @@ Conflict Resolution See [this Notion Page](https://www.notion.so/temporalio/Conflict-Resolution-Example-83e9dec0f8f246ee8584995ae2e408f4) for an example Conflict Resolution. -**Q: How are conflicts resolved?****** +**Q: How are conflicts resolved?** Each cell has a version number, which is used in Event History metadata. Failover operations increase that number. Events with the highest number win during conflict resolution. -**Q: What happens to Workflows if conflicts can’t be resolved?****** +**Q: What happens to Workflows if conflicts can’t be resolved?** This can only happen if there is a bug in the conflict resolution.  If there is a bug in conflict resolution, those events are placed in a dead letter queue to unblock replication. Temporal will resolve the issue and reapply the events. Customer impact is limited to the affected Workflows. The rest of the system continues as normal. -**Q: How is History affected if conflicts can’t be resolved?****** +**Q: How is History affected if conflicts can’t be resolved?** Same as above. -**Q: How do customers detect unresolved conflicts?****** +**Q: How do customers detect unresolved conflicts?** Unresolved conflicts are not made visible to customers. Temporal directs unresolvable conflicts (conflicts that require Temporal on-call intervention) into a dead letter queue and makes sure those conflicts are resolved and their events re-applied. -**Q: How do customers manually resolve conflicts?****** +**Q: How do customers manually resolve conflicts?** No manual resolution by customers is needed unless Temporal cannot handle a specific scenario. -**Q: Are non-selected event histories deleted during automatic conflict resolution? ****** +**Q: Are non-selected event histories deleted during automatic conflict resolution? ** No. They are hidden but not deleted. We do not expose access to non-selected events to customers. Data Loss -**Q: Under what circumstances would a Workflow Execution be unrecoverable if it was started but not replicated before failover?****** +**Q: Under what circumstances would a Workflow Execution be unrecoverable if it was started but not replicated before failover?** The normal time difference between the two operations is typically measured in single-digit seconds.  So this scenario can only happen if the Cluster is healthy enough to accept the Workflow start request and fails to replicate this event. This is very unlikely. If it did happen, the started Workflow is recovered after the Cluster is itself recovered. The only possibility of data loss would require that Temporal lose contact with the previously active cell after permanently completing an operation. Metrics and Observability -**Q: What information can be pulled from MRN metrics?****** +**Q: What information can be pulled from MRN metrics?** -This is [documented](https://docs.temporal.io/cloud/multi-region#metrics-operations). +This is [documented](/cloud/high-availability/best-practices#metrics). Always check metric replication lag before initiating a failover test or emergency failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. -**Q: What warning signs Signal that a failover may be arriving?****** +**Q: What warning signs Signal that a failover may be arriving?** You should always be prepared for failover. One could happen at any point in time. @@ -188,32 +183,32 @@ We notify customers when a failover occurs. There is no time lapse between disco Other -**Q: Can Signals be sent twice since multi-region doesn't provide at-most-once delivery?****** +**Q: Can Signals be sent twice since multi-region doesn't provide at-most-once delivery?** During conflict resolution, a Signal could be applied twice. -**Q: What happens if the active region is unavailable for an extended period and the standby region does not have the most recent Signal?**** ****** +**Q: What happens if the active region is unavailable for an extended period and the standby region does not have the most recent Signal?**** ** Workers cannot process the Signal as it won’t be present in any available region. -**Q: If the active region remains unavailable for an extended period, does the active role switch to the standby region? ****** +**Q: If the active region remains unavailable for an extended period, does the active role switch to the standby region? ** If Temporal Cloud initiated the failover, it will “fail back” to the original active region once the incident is fully resolved. Otherwise, the active role remains with the newly active (formerly standby) region. -**Q: ****[Is there a way to determine the region an event ran in via the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717092196761469) +**Q: ****[Is there a way to determine the region an event ran in via the UI?**](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717092196761469) Not at the moment. -**Q: ****[Can we show branching in the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717171982044329) +**Q: ****[Can we show branching in the UI?**](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717171982044329) Not at the moment. -**Q: What should customers worry about in terms of Signals and events synchronization?****** +**Q: What should customers worry about in terms of Signals and events synchronization?** Signals are cherry-picked during conflict resolution if there is replication lag and conflict. Workflows can theoretically revert multiple steps. -Customers should decide whether to add logic to handle this or manually fix affected Workflows if they believe the risk is low. Other known limitations have been [documented](https://docs.temporal.io/cloud/multi-region#architecture) around causality and so forth. +Customers should decide whether to add logic to handle this or manually fix affected Workflows if they believe the risk is low. Other known limitations have been [documented](/cloud/high-availability/how-it-works#architecture) around causality and so forth. -**Q: How much time does it take to reconcile data after an incident is resolved? ****** +**Q: How much time does it take to reconcile data after an incident is resolved? ** It depends on the distribution of Workflows. If evenly distributed, data can sync quickly. If concentrated in a single partition, it could take hours. Do not “fail back” your region (revert it to the original active region) until the data is fully reconciled and the other region has caught up. diff --git a/docs/production-deployment/cloud/high-availability/guarantees.mdx b/docs/production-deployment/cloud/high-availability/guarantees.mdx deleted file mode 100644 index f4361e7325..0000000000 --- a/docs/production-deployment/cloud/high-availability/guarantees.mdx +++ /dev/null @@ -1,61 +0,0 @@ ---- -id: guarantees -title: Guarantees and availability -sidebar_label: Guarantees and availability -slug: /cloud/high-availability/guarantees -description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. -tags: - - Temporal Cloud - - Production - - High availability -keywords: - - availability - - explanation - - failover - - high-availability - - multi-region - - multi-region namespace - - namespaces - - temporal-cloud - - term ---- - -:::tip Support, stability, and dependency info - -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. - -::: - -
**Maybe these could just be parts of the FAQ?**
- - -## Multi-region Namespace SLA {#sla} - -**What guarantees does Temporal offer for multi-region Namespaces?** - -Multi-region Namespaces offer 99.99% availability, enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla). -Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved. - -Our recovery point objective ([RPO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Point_Objective)) is near-zero. -There may be a short period of time during an incident or forced failover when some data is unavailable in the standby region. -Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile. - -Temporal Cloud proactively responds to incidents by triggering failovers. -Our recovery time objective ([RTO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Time_Objective)) is 20 minutes or less per incident. - -:::info - -During a disaster scenario in which the data on the hard drives in the active region cannot be recovered, the duration of data loss may be as high as the [replication lag](/cloud/multi-region#replication-lag) at the time of disaster. - -::: - -### Regional availability {#regional-availability} - -Multi-region Namespaces are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions). - -:::tip - -Namespace pairing is currently limited to regions within the same continent. -South America is excluded as only one region is available. - -::: diff --git a/docs/production-deployment/cloud/high-availability/how-it-works.mdx b/docs/production-deployment/cloud/high-availability/how-it-works.mdx index 9311b6794e..3f06c2306e 100644 --- a/docs/production-deployment/cloud/high-availability/how-it-works.mdx +++ b/docs/production-deployment/cloud/high-availability/how-it-works.mdx @@ -21,17 +21,9 @@ keywords: --- import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; -:::tip Support, stability, and dependency info - -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. - -::: - -
**No audits, updates, intros, re-org**
- In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency. -In contrast, with a Temporal Cloud high-availability Namespace, only the active zone accepts requests and writes at any given time. -Workflow history events are written to the active zone first and then asynchronously replicated to the standby zone replica, ensuring that the replica remains in sync. +In contrast, with a Temporal Cloud high-availability Namespace, only the active Namespace accepts requests and writes at any given time. +Workflow history events are written to the active Namespace first and then asynchronously replicated to the standby replica, ensuring that the replica remains in sync.
**Needs new images**
@@ -39,22 +31,21 @@ Workflow history events are written to the active zone first and then asynchrono | :-------------------------------------------------------: | :-----------------------------------------------------: | | ![Before failover](/img/multi-region/before-failover.png) | ![After failover](/img/multi-region/after-failover.png) | -## Failovers {#failovers} +## The failover process {#failovers} -A failover shifts Workflow Execution processing from an active Temporal Namespace region to a standby Temporal Namespace region during outages or other incidents. -Standby Namespace regions use replication to duplicate data and prevent data loss during failover. +A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby replica during outages or other incidents. +Standby replicas duplicate data and prevent data loss during failover. **What happens during the failover process?** -Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a multi-region Namespace. -The failover shifts Workflow processing to a standby region that isn’t affected by the incident. +Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a high availability Namespace. +The failover shifts Workflow processing to a replica that isn’t affected by the incident. This lets existing Workflows continue and new Workflows start while the incident is fixed. -Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original region. - +Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original Namespace. :::info -You can test the failover of your multi-region Namespace by manually [triggering a failover](/cloud/multi-region#triggering-failovers) using the UI page or the 'tcld' CLI utility. +You can test the failover of your high availability Namespace by manually [triggering a failover](/cloud/high-availability/failovers#triggering-failovers) using the UI page or the 'tcld' CLI utility. In most scenarios, we recommend you let Temporal handle failovers for you. ::: @@ -69,15 +60,15 @@ It automatically triggers failovers when these indicators exceed our allowed thr ### Replication lag {#replication-lag} -Multi-region Namespaces use asynchronous replication between regions. -Workflow updates in the active region, along with associated history events, are transmitted to the standby region with a short delay. +High availability Namespaces use asynchronous replication. +Workflow updates in the active Namespace, along with associated history events, are transmitted to the standby replica with a short delay. This delay is called the replication lag. Temporal Cloud strives to maintain a P95 replication delay of less than 1 minute. In this context, P95 means 95% of requests are processed faster than this specified limit. -Replication lags mean a [forced failover](/cloud/multi-region#forced-failover) may cause Workflows to rollback in progress. -Lags may also cause recently started Workflows to be temporarily unavailable until the active region recovers. -Temporal event versioning and [conflict resolution mechanisms](/cloud/multi-region#conflict-resolution) help guarantee that the Workflow Event History can be replayed. +Replication lags mean a [forced failover](/cloud/high-availability/how-it-works#forced-failover) may cause Workflows to rollback in progress. +Lags may also cause recently started Workflows to be temporarily unavailable until a Namespace recovers. +Temporal event versioning and [conflict resolution mechanisms](/cloud/high-availability/how-it-works#conflict-resolution) help guarantee that the Workflow Event History can be replayed. Critical operations like Signals won't get lost. ### Failover scenarios @@ -102,8 +93,8 @@ This mode favors _consistency_ over availability. #### Forced failover {#forced-failover} -In this mode, a Namespace immediately activates in the standby region. -Events not replicated due to [replication lag](/cloud/multi-region#replication-lag) will undergo [conflict resolution](/cloud/multi-region#conflict-resolution) upon reaching the new active region. +In this mode, a replica immediately activates in the standby Namespace. +Events not replicated due to [replication lag](/cloud/high-availability/best-practices#metrics) will undergo [conflict resolution](/cloud/high-availability/how-it-works#conflict-resolution) upon reaching the new active Namespace. This mode prioritizes _availability_ over consistency. @@ -116,32 +107,35 @@ Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements. -See the sections on [triggering a failover](/cloud/multi-region#triggering-failovers), [Worker deployment](/cloud/multi-region#worker-deployment), and [routing](/cloud/multi-region#routing) for more information. +See the sections on [triggering a failover](/cloud/high-availability/failovers/#triggering-failovers), [Worker deployment](/cloud/high-availability/best-practices/#worker-deployment), and [routing](/cloud/high-availability/best-practices#routing) for more information. ## Architecture {#architecture} -**How do multi-region Namespaces work?** +**How do high availability Namespaces work?** -Multi-region Namespaces replicate Namespace metadata and Workflow Executions across connected regions. +High availability Namespaces replicate Namespace metadata and Workflow Executions across connected Namespaces. This redundancy, plus the added failover capability, provides measurable stability when dealing with outages. -A multi-region Namespace is normally active in a single region at any moment. -The passive region assumes a standby role. +A high availability Namespace is normally active in a single isolation domain at any moment. +The passive replica assumes a standby role. An exception to this only occurs in the event of a network partition. -In this case, you may elect to promote a standby region to active status. +In this case, you may elect to promote a standby isolation domain to active status. Caution: this action will temporarily result in both regions being active. -Once the network partition resolves and communication between the regions is restored, a conflict resolution algorithm determines which region continues as the active one. -This ensures only one region remains active. +Once the network partition resolves and communication between the isolation domains/regions is restored, a conflict resolution algorithm determines which region continues as the active one. +This ensures only one Namespace remains active. ### Metadata replication {#metadata-replication} -Updates to multi-region Namespace records automatically replicate across regions. +Updates to high availabillity Namespace records automatically duplicate to their replica. This metadata includes configurations such as retention periods, Search Attributes, and other settings. -Temporal Cloud ensures that all regions will eventually share a consistent and unified view of the Namespace metadata. +Temporal Cloud ensures that all isolation domains and regions will eventually share a consistent and unified view of the Namespace metadata. + + +
**Needs correct field name**
:::info -A Namespace failover, which changes the "active region" field of a Namespace record, is an update. +A Namespace failover, which changes the identifier for the active element field of a Namespace record, is an update. This update is replicated via the Namespace metadata mechanism. ::: @@ -150,36 +144,37 @@ This update is replicated via the Namespace metadata mechanism. Temporal Cloud restricts certain Workflow operations to the active region: -- You may only update Workflows in the active region. -- You may only dispatch Workflow Tasks and Activity Tasks from the active region. Forward progress in a Workflow Execution can therefore only be made in the active region. +- You may only update Workflows in the active Namespace. +- You may only dispatch Workflow Tasks and Activity Tasks from the active Namespace. + Forward progress in a Workflow Execution can therefore only be made in the active Namespace. -These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the active region. -Standby regions may receive API requests from Clients and Workers. +These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the active Namespace. +Standby replicas may receive API requests from Clients and Workers. They automatically forward these requests to the active Namespace for execution. -Multi-region Namespaces provide an “all-active” experience for Temporal users. +High availability Namespaces provide an “all-active” experience for Temporal users. This helps limit or eliminate downtime during Namespace failover. -There's a short time window from when a standby region becomes the active region to when Clients and Workers receive a DNS update. -During this time requests forward from the now passive (formerly active) region to the newly active (formerly standby) region. +There's a short time window from when a standby replica becomes the active Namespace to when Clients and Workers receive a DNS update. +During this time requests forward from the now passive (formerly active) replica Namespace to the newly active (formerly standby replica) Namespace. -As Workflow Executions progress and are operated on, replication tasks created in the active region are dispatched to the standby region. -Processing these replication tasks ensures that the standby region undergoes the same state transitions as the active region. +As Workflow Executions progress and are operated on, replication tasks created in the active Namespace are dispatched to the standby replica. +Processing these replication tasks ensures that the standby replica undergoes the same state transitions as the active Namespace. This enables replicated tasks to synchronize and achieve the same state as the original tasks. -Standby regions do not distribute Workflow or Activity Tasks. +Standby replicas do not distribute Workflow or Activity Tasks. Instead, they perform verification tasks to confirm that intended operations are executed so Workflows reach the desired state. This mechanism ensures consistency and reliability in the replication process across Temporal regions. ### Conflict Resolution {#conflict-resolution} -Multi-region Namespaces rely on asynchronous event replication across Temporal regions. -In the event of a non-graceful failover, replication lag may result in a temporary setback in workflow progress. +High availability Namespaces rely on asynchronous event replication across Temporal isolation domains and regions. +In the event of a non-graceful failover across regions, replication lag may result in a temporary setback in Workflow progress. -Single-region Namespaces can be configured to provide _at-most-once_ semantics for Activities execution (when [Maximum Attempts](https://docs.temporal.io/retry-policies#maximum-attempts) is set to 0). -Multi-region Namespaces provide _at-least-once_ semantics for execution of Activities. +Namespaces that do not participate in high availability can be configured to provide _at-most-once_ semantics for Activities execution (when [Maximum Attempts](https://docs.temporal.io/retry-policies#maximum-attempts) is set to 0). +High availability Namespaces provide _at-least-once_ semantics for execution of Activities. Completed Activities _may_ be re-dispatched in a newly active region, leading to repeated executions. -When a Workflow Execution is updated in a new region following a failover, events from the previously active region that arrive after the failover can't be directly applied. +When a Workflow Execution is updated in a new Namespace following a failover, events from the previously active Namespace that arrive after the failover can't be directly applied. At this point, Temporal Cloud has forked the Workflow History. After failover, Temporal Cloud creates a new branch history for execution, and begins its conflict resolution process. diff --git a/docs/production-deployment/cloud/high-availability/index.mdx b/docs/production-deployment/cloud/high-availability/index.mdx index 7fda043014..dd5a84c8dc 100644 --- a/docs/production-deployment/cloud/high-availability/index.mdx +++ b/docs/production-deployment/cloud/high-availability/index.mdx @@ -22,18 +22,47 @@ keywords: import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; -:::tip Support, stability, and dependency info +Temporal Cloud's replicated Namespaces provide disaster-tolerant deployment for workloads where availability is critical to your operations. +When you enable high availability, Temporal Cloud automatically synchronizes your data between a primary and a fallback Namespace, keeping them in sync. +Should an incident occur, Temporal will [failover](/glossary#failover) your Namespace. +This allows your Workflow Executions and Schedules to seamlessly shift from the active availability zone to the synchronized replica in the fallback availability zone. + +Advantages of using Temporal Cloud’s High Availability features: + +- No manual deployment or configuration needed, just simple push-button operations. +- Existing Workflows resume seamlessly in the replica with minimal interruption and data loss. +- No changes needed for Worker and Workflow code during setup or failover. +- 99.99% contractual SLA. + +## High availability options + +Temporal currently offers the following High Availability features, which you configure at a Namespace level: -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. +- **Replication**: + Workflows are seamlessly replicated to a different isolation domain within the same region as the Namespace, such as "us-east-1". + Choose this option for applications architected for a single-region. + You will failover within the same region to a separate isolation domain. +- **Multi-region Replication**: + Workflows are seamlessly replicated to a different region that you choose. + Choose this option when your business requires multi-regional availability and the higher-level of resilience that separated locations offers. + You will failover from one region to a separate region. + +:::note + +Please note that replication charges apply when enabling High Availability features. +For pricing details, visit Temporal Cloud's [Pricing](/cloud/pricing) page. ::: -
**General note, we should be using "Replica" vs "Region", and probably note "Zone". Discuss. "Replica" is the corpus of Workflows. High Availability is a _capability_. Features are "replication" and "multi-region". Features achieve HA workflows in your applications. See messaging guide.**
+## Replication and replicas + + +High Availability features in Temporal Cloud simplify deployment, ensuring operational continuity and data integrity even during unexpected events impacting an availability zone (AZ) or a region using a process called Replication. +Replication asynchronously replicates Workflow Executions from an active region to its replica. +Using Temporal Cloud’s High Availability features, you can create a replica in the same region or in a different region. +In the event of network service or performance issues in the active region, your replica is ready to take over. +Temporal Cloud smoothly transitions control from the active to the replica via a "failover". -Temporal Cloud's high-availability Namespaces provide disaster-tolerant deployment for workloads where availability is critical to your operations. -When you enable high availability, Temporal Cloud automatically synchronizes your data between a primary and a fallback Namespace, keeping them in sync. -Should an incident occur, Temporal will [failover](/glossary#failover) your Namespace. -This allows your Workflow Executions and Schedules to seamlessly shift from the active availability zone to the replica in the fallback availability zone. ## Availability zones and replicas @@ -41,9 +70,9 @@ An availability zone is a physically isolated data center within a deployment re Regions consist of multiple availability zones, providing redundancy and fault tolerance. In some cases, the fallback zone may be in the same region as the primary zone, or it may be in a different region altogether, depending on your deployment configuration. -High-availability simplifies deployment, ensuring operational continuity and data integrity even during unexpected events. +High availability simplifies deployment, ensuring operational continuity and data integrity even during unexpected events. Regional disruptions or other issues that affect the data centers within a specific availability zone may occur. -High-availability allows processing to shift from the affected zone to an already-synchronized fallback zone. +High availability allows processing to shift from the affected zone to an already-synchronized fallback zone. This synchronized zone is called a "**replica**." The process of duplicating all Workflow data ensures that your replica, which serves as the standby region, is always available and ready to take on the active role. @@ -51,13 +80,13 @@ The process of duplicating all Workflow data ensures that your replica, which se In the event of network service or performance issues in the active zone, your replica is ready to take over. When necessary, Temporal Cloud smoothly transitions control from the active to the standby zone using a process called "[failover](/glossary#failover)". -## High-availability and business continuity {#high-availability-intro} +## High availability and business continuity {#high-availability-intro} -For many organizations, ensuring high-availability is critical to maintaining business continuity. -Temporal Cloud's high-availability Namespace feature includes a 99.99% contractual Service Level Agreement ([SLA](https://docs.temporal.io/cloud/sla)). +For many organizations, ensuring high availability is critical to maintaining business continuity. +Temporal Cloud's high availability Namespace feature includes a 99.99% contractual Service Level Agreement ([SLA](https://docs.temporal.io/cloud/sla)). It provides 99.99% availability and 99.99% guarantee against service errors. -A high-availability Namespace (HAN) creates a single logical Namespace that operates across two physical zones: one active and one standby. +A high availability Namespace (HAN) creates a single logical Namespace that operates across two physical zones: one active and one standby. HANs streamline access for both zones to a unified Namespace endpoint. As Workflows progress in the active zone, history events are asynchronously replicated to the standby zone, ensuring continuity and data integrity. @@ -67,9 +96,9 @@ Once failover occurs, the roles of the active and standby zones switch. The standby zone becomes active, and the previous active zone becomes the standby. After the issue is resolved, the zone "fails back" from the replica to the original. -## Types of high-availability +## Types of high availability -Temporal currently offers the following high-availability options, which you select when upgrading your Namespace to use high-availability: +Temporal currently offers the following high availability options, which you select when upgrading your Namespace to use high availability: - **In-region replication** - Data is replicated to a separate zone in the same availability region, such as "us-east-1". @@ -79,35 +108,50 @@ Temporal currently offers the following high-availability options, which you sel This option offers the greatest protection against weather events and other possible external causes for regional outages, as the regions are physically separated by large distances. Failover may experience some minor latency. -
**Did we want to go into further detail about lag and conflict resolution? if so, where?**
- -
**Would be very nice to have more compelling explanations here**
- :::tip As Namespace pairing is currently limited to regions within the same continent, South America is excluded as only one region is available. ::: -## Should you choose high-availability? +## Should you choose high availability? -Should you be using high-availability Namespaces? It depends on your availability requirements: +Should you be using high availability Namespaces? It depends on your availability requirements: -- High-availability Namespaces offer a 99.99% contractual SLA for workloads with strict high-availability needs. +- High availability Namespaces offer a 99.99% contractual SLA for workloads with strict high availability needs. HANs use two Namespaces in two deployment zones to support standby recovery. In the event of a zone failure, Temporal Cloud automatically fails over the HAN Namespace to the standby replica. - Single-zone Namespaces include a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)). In single-zone use, Temporal clients connect to a single Namespace in one deployment zone. For many applications, this offers sufficient availability. -Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high-availability. +Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high availability. -## Summary +## SLA guarantees {#sla} -The following list reviews the advantages of high-availability Namespaces: +High availability Namespaces offer 99.99% availability, enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla). +Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved. -- No manual deployment or configuration required—just simple push-button operation. -- Open Workflows continue in the standby zone with minimal interruption and data loss. -- No changes needed for Worker or Workflow code during setup or failover. -- 99.99% contractual SLA. +Our recovery point objective ([RPO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Point_Objective)) is near-zero. +There may be a short period of time during an incident or forced failover when some data is unavailable in the standby region. +Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile. + +Temporal Cloud proactively responds to incidents by triggering failovers. +Our recovery time objective ([RTO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Time_Objective)) is 20 minutes or less per incident. + +:::info + +During a disaster scenario in which the data on the hard drives in the active region cannot be recovered, the duration of data loss may be as high as the [replication lag](/cloud/high-availability/best-practices#metrics) at the time of disaster. + +::: +## Regional availability {#regional-availability} + +Multi-region Namespaces are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions). + +:::tip + +Namespace pairing is currently limited to regions within the same continent. +South America is excluded as only one region is available. + +::: diff --git a/docs/production-deployment/cloud/high-availability/monitor.mdx b/docs/production-deployment/cloud/high-availability/monitor.mdx deleted file mode 100644 index 4f3211dabb..0000000000 --- a/docs/production-deployment/cloud/high-availability/monitor.mdx +++ /dev/null @@ -1,81 +0,0 @@ ---- -id: monitor -title: Monitor and observe -sidebar_label: Monitor and observe -slug: /cloud/high-availability/operations -description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. -tags: - - Temporal Cloud - - Production - - High availability -keywords: - - availability - - explanation - - failover - - high-availability - - multi-region - - multi-region namespace - - namespaces - - temporal-cloud - - term ---- -import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; - -:::tip Support, stability, and dependency info - -High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. - -::: - -
**No audits, updates, intros, re-org. Add information about unhealth vs health, and trigger failover button disabled, opting out of temporal-initiated failovers, "Unhealthy replica error". After moving things to enable, this seems really light on content**
- -How do you trigger failovers and observe Workflow Executions? -This section provides how-to instructions for the following operations tasks: - -- [Triggering failovers](/cloud/multi-region#triggering-failovers) -- [Metrics](/cloud/multi-region#metrics-operations) -- [Monitoring and observability](/cloud/multi-region#observability) - -### Metrics {#metrics-operations} - -Replication lag refers to the transmission delay of Workflow updates and history events from the active region to the standby region. -A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress, so always check the metric replication lag before initiating a failover. -Temporal Cloud emits three replication lag-specific [metrics](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket). -The following samples demonstrate how you can use these metrics to explore replication lag. - -**P99 replication lag histogram** - -``` -histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le)) -``` - -**Average replication lag** - -``` -sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace) -/ -sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace) -``` - -### Monitoring and observability {#observability} - -You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. -For example, during the process of adding a region to a Namespace, you can see the progress of Workflow replication. -Errors -- if any occur -- will also surface in the Namespace Web UI. - -:::info - -You may notice that multi-region Namespace shows twice (2x) the Action count in `temporal_cloud_v0_total_action_count`. -This doubling happens due to regional replication. - -::: - -### Auditing operational events {#auditing} - -Temporal Cloud provides several ways to audit events: - -- When Temporal triggers failovers, the audit log updates with details. - Look specifically for `"operation": "FailoverNamespace"` in the logs. -- You can set alerts for Temporal-initiated failover events. -- After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI. - diff --git a/sidebars.js b/sidebars.js index cecd9b1ff5..e79006cc83 100644 --- a/sidebars.js +++ b/sidebars.js @@ -348,9 +348,9 @@ module.exports = { }, items: [ "production-deployment/cloud/high-availability/enable", + "production-deployment/cloud/high-availability/failovers", "production-deployment/cloud/high-availability/how-it-works", - "production-deployment/cloud/high-availability/guarantees", - "production-deployment/cloud/high-availability/monitor", + "production-deployment/cloud/high-availability/best-practices", "production-deployment/cloud/high-availability/faq", ], },