diff --git a/docs/production-deployment/cloud/high-availability/enable.mdx b/docs/production-deployment/cloud/high-availability/enable.mdx new file mode 100644 index 0000000000..c3952dacde --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/enable.mdx @@ -0,0 +1,86 @@ +--- +id: enable +title: Enable high availability +sidebar_label: Enable high availability +slug: /cloud/high-availability/choosing-high-availability +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +
**Some audits, updates. Needs intros, re-org**
+ +You can enable the high-availability Namespace feature for your existing Namespace by [adding a second zone](#add-zones) to your Namespace. +After adding the second zone, Temporal Cloud begins data replication for your new standby replica. +Temporal Cloud notifies you once the replication has caught up and both Namespace zones are in sync. + +**Advantages of using a high-availability Namespace:** + +- No manual deployment or configuration needed, just simple push-button operation. +- Open Workflows continue in the standby region with minimal interruption and data loss. +- No changes needed for Worker and Workflow code during setup or failover. +- 99.99% Contractual SLA. + +## Upgrade an existing single-zone Namespace for high-availability functionality {#add-zones} + +You can upgrade existing ssingle-zone Namespace for high-availability by adding a standby zone. +The following sections show you how. + +
**The following material has not been audited for MRN/HAN**
+ +#### Temporal Cloud Web UI + +To upgrade an existing Namespace to a multi-region Namespace: + +1. Visit Temporal Cloud [Namespaces](https://cloud.temporal.io/namespaces) in your Web browser +1. Navigate to the Namespace details page +1. Select the “Add a region” button. +1. Select the standby region you want to add to this Namespace + +You will see an estimated time for replication. +This time is based on your selection and the size and scale of Workflows in your Namespace, +An email alert is sent once your multi-region Namespace is ready for use. + +#### Temporal 'tcld' CLI + +At the command line, enter: + +``` +tcld namespace add-region \ + --namespace . \ + --region +``` + +Specify the region code for the new region to add. +Before pressing return, add your authentication credentials. For example, `--ca-certificate-file `. +An email alert is sent once your multi-region Namespace is ready for use. + +### Discontinuing multi-region availability {#discontinuing} + +Disabling multi-region removes the high availability and automatic failover features that provide Temporal's highest service level agreement. +To disable the feature and end charges, users must contact [Temporal Support](https://support.temporal.io) directly. +MRN-specific charges for replication will stop once this decommissioning procedure completes. + +- When making your request you must let us know which region you want the Namespace to land in after removing the standby region. +- If you cease services in the middle of the month, your Namespace will be converted to a single region Namespace within 1 business day. +- Temporal won't retain replicated data in the standby region once multi-region has been disabled. +- After disabling multi-region, Temporal Cloud cannot re-enable the feature for a given Namespace for seven days. diff --git a/docs/production-deployment/cloud/high-availability/faq.mdx b/docs/production-deployment/cloud/high-availability/faq.mdx new file mode 100644 index 0000000000..e860c88cc8 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/faq.mdx @@ -0,0 +1,219 @@ +--- +id: faq +title: Frequently Asked Questions +sidebar_label: Frequently Asked Questions +slug: /cloud/high-availability/faq +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +
**Repurposed material. No audits, updates, intros, re-org**
+ +Failovers + +**Q: What is a failover****** + +A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby Temporal Namespace during outages or other incidents. Standby Namespaces use replication to duplicate data and prevent data loss during failover. + +**Q: What failover modes does Temporal use internally?****** + +Users cannot configure failover modes. The following descriptions explain Temporal Cloud’s internal failover system: + + * ******[Graceful failover**](https://docs.temporal.io/cloud/multi-region#graceful-failover): Replication tasks are fully processed and drained before transferring control to the standby region. Temporal Cloud pauses traffic to the active Namespace before the failover, minimizing the rewind of progress and avoiding data conflicts. The Namespace experiences a short period of unavailability, defaulting to 10 seconds at most. Under most circumstances, the actual time the Namespace is unavailable is much, much shorter than that. + +During this period, existing Workflows stop progress. Temporal Cloud returns a "Service unavailable error". State transitions will not happen and tasks are not dispatched. User requests like start/signal Workflow will be rejected while operations are paused during handover. This mode favors _consistency_ over availability. + + * ******[Forced failover**](https://docs.temporal.io/cloud/multi-region#forced-failover): In this mode, a Namespace immediately activates in the standby region. Events not replicated due to replication lag will undergo conflict resolution upon reaching the new active region. This mode prioritizes _availability_ over consistency. + * ******[Hybrid failover**](https://docs.temporal.io/cloud/multi-region#hybrid-failover) (Default mode): While graceful failovers are consistent, they aren’t always practical in certain circumstances such as when cells experience outages and/or a critical database is unavailable. Temporal Cloud’s hybrid failover mode limits an initial Graceful failover attempt to 10 seconds or less. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements. This strategy balances _consistency_ and _availability_ requirements. + +**Q: What is the difference between a handover and a failover?****** + +They are essentially the same thing. It is the process of transferring control from the active to the standby region during outages or other incidents. + + + + + + + + + + +**Q: What situation triggers a graceful failover vs a forced one? Who or what triggers it? What are the differences in results?****** + +Users can initiate a failover, but they can’t control or configure the failover mode. The three failover modes are internal operations on the Temporal side. They are explained for user education. + +**Q: Under what circumstances would I want to initiate a failover?****** + +Normally, we don't expect users to failover when there’s a problem related to Temporal Cloud. Please contact Temporal support if you feel you have a pressing need. You might consider initiating a failover under two circumstances: + + * The “break glass”  (fire alarm) scenario where Temporal hasn’t responded to an outage or is unaware of an outage in the customer environment. + * To test the failover functionality and ensure it works properly. + +You can still choose to initiate failover if you have issues sourced from your side or your dependencies. + +**Q: Under what circumstances does Temporal initiate failovers?****** + +Temporal Cloud initiates failovers when there are incidents or outages in the cloud provider. This includes failures of databases, storage, etc. We trigger failovers any time we observe increased latencies or an increase in service errors that causes us to violate the SLA that is in our control. + +**Q: Are there any other types of failover not listed above?****** + +No. There are only three types. + +**Q: Can we control the hybrid failover timeout?****** + +No. The timeout is not configurable outside of Temporal. + +**Q: Can a failover get stuck? What is the maximum amount of time it can take?****** + +There is typically no way for failovers to get “stuck”. We follow the hybrid failover method where we try to do a smooth handoff. If that does not take place within 10 seconds, we initiate a “forced” failover. + +**Q: Is the 10 seconds maximum unavailability window configurable?****** + +No, it is not configurable by the user. Extending the wait time is unlikely to increase the chances of graceful failovers during extreme incidents such as when a source region is down. + +**Q: How does the client detect that the failover has occurred?****** + +We do not send real-time failover notifications. Users are notified via email and audit logs. + +**Q: Can the customer determine the resolved failover region?****** + +Users can determine the failover region from the Namespace endpoint’s CNAME (\.tmprl.cloud). Whenever Temporal Cloud triggers a failover from the Temporal side, we update the CNAME to point to the new active region. The CNAME points to a Temporal Cloud regional endpoint. For example, a Namespace active in aws us-east-1 points to aws-**us-east-1**.region.tmprl.cloud. + +Replication Lag and Latency + +**Q: What affects replication latency? What can cause the replication latency to increase?****** + +Slowdowns in the standby cell, such as capacity issues or outages, can increase replication latency. Otherwise, it is typically a matter of seconds or even less and can be monitored through  [external metrics](https://docs.temporal.io/cloud/multi-region#metrics-operations). + +**Q: Can Workflows execute events a second time in the standby Cluster due to replication lag?****** + +Yes. This is explained in the [conflict resolution](https://docs.temporal.io/cloud/multi-region#conflict-resolution) section of our documentation. + +**Q: Is it possible to see whether a failover was graceful, forced, or hybrid?****** + +No, customers cannot normally view the method used. File a support ticket if there’s a specific need to review a process. + +**Q: Is replication lag emitted as a metric?****** + +Yes, replication lag is a [metric](https://docs.temporal.io/cloud/multi-region#metrics-operations) that we expose. + +**Q: Can we see replication information by Workflow type or ID?****** + +_[Answer in progress]___ + +**Q: Is the data replicated in order?****** + +For a single Workflow, events are replicated  in order. There's no ordering guarantee for replication of events between different Workflows. + +**Q: What happens if both regions become active simultaneously?****** + +This only happens when there's a network partition or delays in the Namespace replication queue. Normally, when cells can talk to each other, only one region will ever become active. + +If both regions have become active and both have active Workers, Workflows will run independently based on their local History. Workers fetch tasks from their assigned region. With global Worker setups, Workers fetch tasks from the ‘true’ active region as known by Temporal Cloud. Eventually, when the network partition heals, History is merged via conflict resolution and one side wins. + +**Q: What if DNS is still updating during a network partition between Clusters? ****** + +In this situation, the now passive Cluster can’t forward requests to the new active Cluster. However, DNS normally points to the correct active Cluster without forwarding. Workers configured to point to the standby Cluster can be reconfigured to point to the active Cluster. + +Conflict Resolution + +See [this Notion Page](https://www.notion.so/temporalio/Conflict-Resolution-Example-83e9dec0f8f246ee8584995ae2e408f4) for an example Conflict Resolution. + +**Q: How are conflicts resolved?****** + +Each cell has a version number, which is used in Event History metadata. Failover operations increase that number. Events with the highest number win during conflict resolution. + +**Q: What happens to Workflows if conflicts can’t be resolved?****** + +This can only happen if there is a bug in the conflict resolution.  If there is a bug in conflict resolution, those events are placed in a dead letter queue to unblock replication. Temporal will resolve the issue and reapply the events. + +Customer impact is limited to the affected Workflows. The rest of the system continues as normal. + +**Q: How is History affected if conflicts can’t be resolved?****** + +Same as above. + +**Q: How do customers detect unresolved conflicts?****** + +Unresolved conflicts are not made visible to customers. Temporal directs unresolvable conflicts (conflicts that require Temporal on-call intervention) into a dead letter queue and makes sure those conflicts are resolved and their events re-applied. + +**Q: How do customers manually resolve conflicts?****** + +No manual resolution by customers is needed unless Temporal cannot handle a specific scenario. + +**Q: Are non-selected event histories deleted during automatic conflict resolution? ****** + +No. They are hidden but not deleted. We do not expose access to non-selected events to customers. + +Data Loss + +**Q: Under what circumstances would a Workflow Execution be unrecoverable if it was started but not replicated before failover?****** + +The normal time difference between the two operations is typically measured in single-digit seconds.  So this scenario can only happen if the Cluster is healthy enough to accept the Workflow start request and fails to replicate this event. This is very unlikely. If it did happen, the started Workflow is recovered after the Cluster is itself recovered. The only possibility of data loss would require that Temporal lose contact with the previously active cell after permanently completing an operation. + +Metrics and Observability + +**Q: What information can be pulled from MRN metrics?****** + +This is [documented](https://docs.temporal.io/cloud/multi-region#metrics-operations). + +Always check metric replication lag before initiating a failover test or emergency failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. + +**Q: What warning signs Signal that a failover may be arriving?****** + +You should always be prepared for failover. One could happen at any point in time. + +We notify customers when a failover occurs. There is no time lapse between discovering failover prerequisites and the failover itself. + +Other + +**Q: Can Signals be sent twice since multi-region doesn't provide at-most-once delivery?****** + +During conflict resolution, a Signal could be applied twice. + +**Q: What happens if the active region is unavailable for an extended period and the standby region does not have the most recent Signal?**** ****** + +Workers cannot process the Signal as it won’t be present in any available region. + +**Q: If the active region remains unavailable for an extended period, does the active role switch to the standby region? ****** + +If Temporal Cloud initiated the failover, it will “fail back” to the original active region once the incident is fully resolved. Otherwise, the active role remains with the newly active (formerly standby) region. + +**Q: ****[Is there a way to determine the region an event ran in via the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717092196761469) + +Not at the moment. + +**Q: ****[Can we show branching in the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717171982044329) + +Not at the moment. + +**Q: What should customers worry about in terms of Signals and events synchronization?****** + +Signals are cherry-picked during conflict resolution if there is replication lag and conflict. Workflows can theoretically revert multiple steps. + +Customers should decide whether to add logic to handle this or manually fix affected Workflows if they believe the risk is low. Other known limitations have been [documented](https://docs.temporal.io/cloud/multi-region#architecture) around causality and so forth. + +**Q: How much time does it take to reconcile data after an incident is resolved? ****** + +It depends on the distribution of Workflows. If evenly distributed, data can sync quickly. If concentrated in a single partition, it could take hours. Do not “fail back” your region (revert it to the original active region) until the data is fully reconciled and the other region has caught up. diff --git a/docs/production-deployment/cloud/high-availability/how-it-works.mdx b/docs/production-deployment/cloud/high-availability/how-it-works.mdx new file mode 100644 index 0000000000..ee1ced9551 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/how-it-works.mdx @@ -0,0 +1,119 @@ +--- +id: how-it-works +title: How it works +sidebar_label: How it works +slug: /cloud/high-availability/how-it-works +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +
**No audits, updates, intros, re-org**
+ +In traditional active/active replication, multiple nodes serve requests and accept writes simultaneously, ensuring strong synchronous data consistency. +In contrast, with a Temporal Cloud high-availability Namespace, only the active zone accepts requests and writes at any given time. +Workflow history events are written to the active zone first and then asynchronously replicated to the standby zone replica, ensuring that the replica remains in sync. + +
**Needs new images**
+ +| Before failover | After failover | +| :-------------------------------------------------------: | :-----------------------------------------------------: | +| ![Before failover](/img/multi-region/before-failover.png) | ![After failover](/img/multi-region/after-failover.png) | + +## Failovers {#failovers} + +A failover shifts Workflow Execution processing from an active Temporal Namespace region to a standby Temporal Namespace region during outages or other incidents. +Standby Namespace regions use replication to duplicate data and prevent data loss during failover. + +**What happens during the failover process?** + +Temporal Cloud initiates a Namespace failover when it detects an incident or outage that raises error rates or latency in the active region of a multi-region Namespace. +The failover shifts Workflow processing to a standby region that isn’t affected by the incident. +This lets existing Workflows continue and new Workflows start while the incident is fixed. +Once the incident is resolved, Temporal Cloud performs a "failback" by shifting Workflow Execution processing back to the original region. + + +:::info + +You can test the failover of your multi-region Namespace by manually [triggering a failover](/cloud/multi-region#triggering-failovers) using the UI page or the 'tcld' CLI utility. +In most scenarios, we recommend you let Temporal handle failovers for you. + +::: + +## Health Checks {#healthchecks} + +**How does Temporal detect failover conditions?** + +Temporal Cloud automates failovers by performing internal health checks. +This process monitors your request error rates, latencies, and any infrastructure issues that might cause service disruptions, such as request timeouts. +It automatically triggers failovers when these indicators exceed our allowed thresholds. + +### Replication lag {#replication-lag} + +Multi-region Namespaces use asynchronous replication between regions. +Workflow updates in the active region, along with associated history events, are transmitted to the standby region with a short delay. +This delay is called the replication lag. +Temporal Cloud strives to maintain a P95 replication delay of less than 1 minute. +In this context, P95 means 95% of requests are processed faster than this specified limit. + +Replication lags mean a [forced failover](/cloud/multi-region#forced-failover) may cause Workflows to rollback in progress. +Lags may also cause recently started Workflows to be temporarily unavailable until the active region recovers. +Temporal event versioning and [conflict resolution mechanisms](/cloud/multi-region#conflict-resolution) help guarantee that the Workflow Event History can be replayed. +Critical operations like Signals won't get lost. + +### Failover scenarios + +The Temporal Cloud failover mechanism supports several modes to execute Namespace failovers. +These modes include graceful failover ("handover"), forced failover, and a hybrid mode. +The hybrid mode is Temporal Cloud’s default Namespace behavior. + +#### Graceful failover (handover) {#graceful-failover} + +In this mode, replication tasks are fully processed and drained. +Temporal Cloud pauses traffic to the Namespace before the failover. +This prevents the loss of progress and avoids data conflicts. +The Namespace experiences a short period of unavailability, defaulting to 10 seconds. + +During this period, existing Workflows stop progress. +Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. +State transitions will not happen and tasks are not dispatched. +User requests like start/signal workflow will be rejected while operations are paused during handover. + +This mode favors _consistency_ over availability. + +#### Forced failover {#forced-failover} + +In this mode, a Namespace immediately activates in the standby region. +Events not replicated due to [replication lag](/cloud/multi-region#replication-lag) will undergo [conflict resolution](/cloud/multi-region#conflict-resolution) upon reaching the new active region. + +This mode prioritizes _availability_ over consistency. + +#### Hybrid failover mode {#hybrid-failover} + +While graceful failovers are preferred for consistency, they aren’t always practical. +Temporal Cloud’s hybrid failover mode (the default mode) limits an initial graceful failover attempt to 10 seconds or less. +During this period, existing Workflows stop progress. +Temporal Cloud returns a "Service unavailable error", which is retried by SDKs. +If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. +This strategy balances consistency and availability requirements. + +See the sections on [triggering a failover](/cloud/multi-region#triggering-failovers), [Worker deployment](/cloud/multi-region#worker-deployment), and [routing](/cloud/multi-region#routing) for more information. diff --git a/docs/production-deployment/cloud/high-availability/index.mdx b/docs/production-deployment/cloud/high-availability/index.mdx new file mode 100644 index 0000000000..8c6b41c178 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/index.mdx @@ -0,0 +1,87 @@ +--- +id: index +title: High-availability Namespaces +sidebar_label: High-availability Namespaces +slug: /cloud/high-availability +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high-availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- + +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +Temporal Cloud's high-availability Namespaces provide disaster-tolerant deployment for workloads where availability is critical to your operations. +When you enable high availability, Temporal Cloud automatically synchronizes your data between a primary and a fallback Namespace, keeping them in sync. +Should an incident occur, Temporal will [failover](/glossary#failover) your Namespace. +This allows your Workflow Executions and Schedules to seamlessly shift from the active availability zone to the fallback availability zone. + +## Availability zones and replicas + +An availability zone is a physically isolated data center within a deployment region for a given cloud provider. +Regions consist of multiple availability zones, providing redundancy and fault tolerance. +In some cases, the fallback zone may be in the same region as the primary zone, or it may be in a different region altogether, depending on your deployment configuration. + +High-availability simplifies deployment, ensuring operational continuity and data integrity even during unexpected events. +Regional disruptions or other issues that affect the data centers within a specific availability zone may occur. +High-availability allows processing to shift from the affected zone to an already-synchronized fallback zone. + +This synchronized zone is called a "replica." +The process of duplicating all Workflow data ensures that your replica, which serves as the standby region, is always available and ready to take on the active role. + +In the event of network service or performance issues in the active zone, your replica is ready to take over. +When necessary, Temporal Cloud smoothly transitions control from the active to the standby zone using a process called "[failover](/glossary#failover)". + +## Why choose high-availability? {#high-availability-intro} + +For many organizations, ensuring high-availability is critical to maintaining business continuity. +Temporal Cloud's high-availability Namespace feature includes a 99.99% contractual Service Level Agreement ([SLA](https://docs.temporal.io/cloud/sla)). +It provides 99.99% availability and 99.99% guarantee against service errors. + +A high-availability Namespace (HAN) creates a single logical Namespace that operates across two physical zones: one active and one standby. +HANs streamline access for both zones to a unified Namespace endpoint. +As Workflows progress in the active zone, history events are asynchronously replicated to the standby zone, ensuring continuity and data integrity. + +In the event of an incident or outage in the active zone, Temporal Cloud will seamlessly failover to your standby zone. +Failovers allow existing Workflow Executions to continue running and new Workflow Executions to be started. +Once failover occurs, the roles of the active and standby zones switch. +The standby zone becomes active, and the previous active zone becomes the standby. +After the issue is resolved, the zone "fails back" from the replica to the original. + +## Opting into high-availability + +Should you be using high-availability Namespaces? It depends on your availability requirements: + +- High-availability Namespaces offer a 99.99% contractual SLA for workloads with strict high-availability needs. + HANs use two Namespaces in two deployment zones to support standby recovery. + In the event of a zone failure, Temporal Cloud automatically fails over the HAN Namespace to the standby replica. +- Single-zone Namespaces include a 99.9% contractual Service Level Agreement ([SLA](/cloud/sla)). + In single-zone use, Temporal clients connect to a single Namespace in one deployment zone. + For many applications, this offers sufficient availability. + +Temporal Cloud provides 99.99% service availability for all Namespaces, both single-region and high-availability. + +| **Advantages of using a multi-region Namespace** | +| ------------------------------------------------------------------------------------ | +| No manual deployment or configuration required—just simple push-button operation. | +| Open Workflows continue in the standby zone with minimal interruption and data loss. | +| No changes needed for Worker or Workflow code during setup or failover. | +| 99.99% contractual SLA. | + diff --git a/docs/production-deployment/cloud/high-availability/operations.mdx b/docs/production-deployment/cloud/high-availability/operations.mdx new file mode 100644 index 0000000000..ace3b7fd2d --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/operations.mdx @@ -0,0 +1,255 @@ +--- +id: operations +title: Operations +sidebar_label: Operations +slug: /cloud/high-availability/operations +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +
**No audits, updates, intros, re-org. Converted to markdown automatically from Google Doc rich text, so there are many errors**
+ +How do you trigger failovers and observe Workflow Executions? +This section provides how-to instructions for the following operations tasks: + +- [Triggering failovers](/cloud/multi-region#triggering-failovers) +- [Metrics](/cloud/multi-region#metrics-operations) +- [Monitoring and observability](/cloud/multi-region#observability) + +### Triggering failovers {#triggering-failovers} + +Failovers happen automatically in Temporal when a regional outage or disaster affects a multi-region Namespace. +You can also trigger a failover based on custom alerts or for testing purposes. +This section explains how to manually trigger a failover and what to expect afterward. + +Regular failover testing ensures your app can handle disruptions and continue running smoothly in production. +Whether responding to incident warnings or conducting tests, follow the steps in the next sections to move your active Namespace to its standby region and learn how to handle failovers effectively. + +For details on how Temporal detects conditions and triggers failovers automatically, see [Failovers](/cloud/multi-region/#failovers). + +:::warning Check Your Replication Lag + +Always check the [metric replication lag](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket) before initiating a failover. +A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress. + +::: + +**Performing manual failovers** + +You can trigger a failover manually using the Temporal Cloud Web UI or the `tcld` CLI, depending on your preference and setup. +The following table outlines the steps for each method: + +| Method | Instructions | +| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Temporal Cloud Web UI** | 1. Visit the [Namespace page](https://cloud.temporal.io/namespaces) on the Temporal Cloud Web UI.
2. Navigate to your Namespace details page and select the **Trigger a failover** option from the menu.
3. After confirming, the failover will be initiated. | +| **Temporal `tcld` CLI** | To manually trigger a failover, run the following command in your terminal:
tcld namespace failover \
    --namespace \.\ \
    --region \ | + +**Post-failover event information** + +After any failover, whether triggered by you or by Temporal, event information appears in both the [Temporal Cloud Web UI](https://cloud.temporal.io/namespaces) (on the Namespace detail page) and in your audit logs. +The audit log entry for Failover uses the `"operation": "FailoverNamespace"` event. +After failover, the Namespace is active in the new region. + +You don't need to monitor Temporal Cloud's failover response in real-time. +Whenever there is a failover event, users with the Account Owner and Global Admin roles automatically receive an alert email. + +**Failbacks** + +After Temporal-initiated failovers, Temporal Cloud shifts Workflow Execution processing back to the original region that was active before the incident (a "failback") once the incident is resolved. + +**Reasons to test failing over** + +Microservices and external dependencies will fail at some point. +Testing failovers ensures your app can handle these failures effectively. +Temporal recommends regular and periodic failover testing for mission-critical applications in production. +By testing in non-emergency conditions, you verify that your app continues to function even when parts of the infrastructure fail. + +:::tip Safety First + +If this is your first time performing a failover test, run it with a test-specific namespace and application. +This helps you gain operational experience before applying it to your production environment. +Practice runs help ensure the process runs smoothly during real incidents in production. + +::: + +Trigger testing can: + +- **Validate multi-region deployments**: + In multi-region setups, failover testing ensures your app can run from another region when the primary region experiences outages. + This maintains high availability in mission-critical deployments. + Manual testing confirms the failover mechanism works as expected, so your system handles regional outages or disasters effectively. + +- **Assess replication lag**: + Monitoring [replication lag](#metrics-operations) between regions is crucial in multi-region setups. + Check the lag before initiating a failover to avoid rolling back Workflow progress. + Manual testing helps you practice this critical step and understand its impact. + When there's no real incident, the switch over (recovery) should happen almost instantly. + +- **Assess recovery time**: + Manual testing helps you measure actual recovery time. + You can check if it meets your expected Recovery Time Objective (RTO) of 20 minutes or less, as stated in the [Multi-region Namespace SLA](/cloud/multi-region#sla). + +- **Identify potential issues**: + Failover testing uncovers problems not visible during normal operation. + This includes issues like [backlogs and capacity planning](https://temporal.io/blog/workers-in-production#testing-failure-paths-2438) and how external dependencies behave during a failover event. + +- **Validate fault-oblivious programming**: + Temporal uses a "fault-oblivious programming" model, where your app doesn’t need to explicitly handle many types of failures. + Testing failovers ensures that this model works as expected in your app. + +- **Operational readiness**: + Regular testing familiarizes your team with the failover process, improving their ability to handle real incidents when they arise. + +Testing failovers regularly ensures your Temporal-based applications remain resilient and reliable, even when infrastructure fails. + +### Metrics {#metrics-operations} + +Replication lag refers to the transmission delay of Workflow updates and history events from the active region to the standby region. +A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress, so always check the metric replication lag before initiating a failover. +Temporal Cloud emits three replication lag-specific [metrics](/production-deployment/cloud/metrics/reference#temporal_cloud_v0_replication_lag_bucket). +The following samples demonstrate how you can use these metrics to explore replication lag. + +**P99 replication lag histogram** + +``` +histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le)) +``` + +**Average replication lag** + +``` +sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace) +/ +sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace) +``` + +### Monitoring and observability {#observability} + +You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. +For example, during the process of adding a region to a Namespace, you can see the progress of Workflow replication. +Errors -- if any occur -- will also surface in the Namespace Web UI. + +:::info + +You may notice that multi-region Namespace shows twice (2x) the Action count in `temporal_cloud_v0_total_action_count`. +This doubling happens due to regional replication. + +::: + +### Auditing operational events {#auditing} + +Temporal Cloud provides several ways to audit events: + +- When Temporal triggers failovers, the audit log updates with details. + Look specifically for `"operation": "FailoverNamespace"` in the logs. +- You can set alerts for Temporal-initiated failover events. +- After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI. + +### Worker Deployment {#worker-deployment} + +Enabling the multi-region Namespace does not require specific Worker configuration. +The process is invisible to the Workers. +When a Namespace fails over to the standby region, the DNS redirection orchestrated by Temporal ensures that your existing Workers continue to poll the Namespace without interruption. +More details are available in the [Routing](/cloud/multi-region#routing) section below. + +:::info + +- When a Namespace fails over to a standby region, Workers will be communicating cross-region. + +- In case of a complete regional outage, Workers in the original region may fail alongside the original Namespace. + To keep Workflows moving during this level of outage, deploy a second set of Workers to your standby region. + +::: + +### Routing {#routing} + +When using multi-region for a Namespace, the Namespace's DNS record `..` targets a regional DNS record in the format `.region.`. +In this format, `` is the currently active region for your Namespace. +Clients resolving the Namespace’s DNS record are directed to connect to the active region for that Namespace, thanks to the regional DNS record. + +During failover, Temporal Cloud changes the target of the Namespace DNS record from one region to another. +Namespace DNS records are configured with a 15 seconds TTL. +Any DNS cache should re-resolve the record within this delay. As a rule of thumb, DNS reconciliation takes no longer than twice (2x) the TTL. +Clients should converge to the newly targeted region within, at, most a 30-second delay. + +#### PrivateLink routing {#privatelink-routing} + +:::important + +Some networking configuration is required for failover to be transparent to clients and workers when using PrivateLink. +This section describes how to configure routing for multi-region Namespaces for PrivateLink customers only. + +::: + +PrivateLink customers may need to change certain configurations for multi-region Namespace use. +Routing configuration depends on networking setup and use of PrivateLink. +You may need to: + +- override a DNS zone; and +- ensure the network connectivity between the two regions. + +![Customer side solution example](/img/multi-region/private-link.png) + +When using PrivateLink, you connect to Temporal Cloud using IP addresses local to your network. +The `region.` zone is configured in the Temporal systems as an independent zone. +This allows you to override it to make sure traffic is routed internally for the regions in use. +You can check the Namespace's active region using the Namespace record CNAME, which is public. + +To set up the DNS override, you override specific regions to target the relevant IP addresses (e.g. aws-us-west-1.region.tmprl.cloud to target 192.168.1.2). +Using AWS, this can be done using a private hosted zone in Route53 for `region.`. +Link that private zone to the VPCs you use for Workers. +Private Link is not yet offered for GCP multi-region Namespaces. + +When your Workers connect to the Namespace, they first resolve the `..` record. +This targets `.region.` using a CNAME. Your private zone overrides that second DNS resolution, leading traffic to reach the internal IP you're using. + +Consider how you'll configure Workers to run in this scenario. +You might set Workers to run in both regions at all times. +Alternately, you could establish connectivity between the regions to redirect Workers once failover occurs. + +The following table lists Temporal's available regions, PrivateLink endpoints, and DNS record overrides. +The `sa-east-1` region listed here is not yet available for use with multi-region Namespaces. + +| Region | PrivateLink Service Name | DNS Record Override | +| ---------------- | -------------------------------------------------------------- | --------------------------------------- | +| `ap-northeast-1` | `com.amazonaws.vpce.ap-northeast-1.vpce-svc-08f34c33f9fb8a48a` | `aws-ap-northeast-1.region.tmprl.cloud` | +| `ap-northeast-2` | `com.amazonaws.vpce.ap-northeast-2.vpce-svc-08c4d5445a5aad308` | `aws-ap-northeast-2.region.tmprl.cloud` | +| `ap-south-1` | `com.amazonaws.vpce.ap-south-1.vpce-svc-0ad4f8ed56db15662` | `aws-ap-south-1.region.tmprl.cloud` | +| `ap-south-2` | `com.amazonaws.vpce.ap-south-2.vpce-svc-08bcf602b646c69c1` | `aws-ap-south-2.region.tmprl.cloud` | +| `ap-southeast-1` | `com.amazonaws.vpce.ap-southeast-1.vpce-svc-05c24096fa89b0ccd` | `aws-ap-southeast-1.region.tmprl.cloud` | +| `ap-southeast-2` | `com.amazonaws.vpce.ap-southeast-2.vpce-svc-0634f9628e3c15b08` | `aws-ap-southeast-2.region.tmprl.cloud` | +| `ca-central-1` | `com.amazonaws.vpce.ca-central-1.vpce-svc-080a781925d0b1d9d` | `aws-ca-central-1.region.tmprl.cloud` | +| `eu-central-1` | `com.amazonaws.vpce.eu-central-1.vpce-svc-073a419b36663a0f3` | `aws-eu-central-1.region.tmprl.cloud` | +| `eu-west-1` | `com.amazonaws.vpce.eu-west-1.vpce-svc-04388e89f3479b739` | `aws-eu-west-1.region.tmprl.cloud` | +| `eu-west-2` | `com.amazonaws.vpce.eu-west-2.vpce-svc-0ac7f9f07e7fb5695` | `aws-eu-west-2.region.tmprl.cloud` | +| `sa-east-1` | `com.amazonaws.vpce.sa-east-1.vpce-svc-0ca67a102f3ce525a` | `aws-sa-east-1.region.tmprl.cloud` | +| `us-east-1` | `com.amazonaws.vpce.us-east-1.vpce-svc-0822256b6575ea37f` | `aws-us-east-1.region.tmprl.cloud` | +| `us-east-2` | `com.amazonaws.vpce.us-east-2.vpce-svc-01b8dccfc6660d9d4` | `aws-us-east-2.region.tmprl.cloud` | +| `us-west-2` | `com.amazonaws.vpce.us-west-2.vpce-svc-0f44b3d7302816b94` | `aws-us-west-2.region.tmprl.cloud` | + +:::tip Learn more about multi-region Namespaces + +If you have more questions or feedback about this feature, reach out to the product team. + +::: + diff --git a/docs/production-deployment/cloud/high-availability/pricing.mdx b/docs/production-deployment/cloud/high-availability/pricing.mdx new file mode 100644 index 0000000000..07806adaa0 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/pricing.mdx @@ -0,0 +1,30 @@ +--- +id: pricing +title: Pricing (and Support?) +sidebar_label: Pricing (and Support?) +slug: /cloud/high-availability/pricing +description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted. +tags: + - Temporal Cloud + - Production + - High availability +keywords: + - availability + - explanation + - failover + - high-availability + - multi-region + - multi-region namespace + - namespaces + - temporal-cloud + - term +--- +import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead'; + +:::tip Support, stability, and dependency info + +High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud. + +::: + +
**No audits, updates, intros, re-org**
diff --git a/docs/production-deployment/cloud/high-availability/work-file.txt b/docs/production-deployment/cloud/high-availability/work-file.txt new file mode 100644 index 0000000000..a62dab0bb1 --- /dev/null +++ b/docs/production-deployment/cloud/high-availability/work-file.txt @@ -0,0 +1,203 @@ +
**STOPPED HERE. Considering whether this should be its own page**
+ +AFFECTED COVERAGE: + +- ALL LINKS ON NEW PAGES MUST BE AUDITED, REDONE, TESTED!!! + +THESE PAGES MENTION MULTI-REGION: +./docs/encyclopedia/nexus.mdx +./docs/encyclopedia/nexus-use-cases.mdx +./docs/evaluate/development-production-features/temporal-nexus.mdx +./docs/evaluate/development-production-features/index.mdx +./docs/evaluate/development-production-features/high-availability.mdx +./docs/evaluate/development-production-features/multi-region-namespace.mdx +./docs/evaluate/development-production-features/cloud-vs-self-hosted.mdx +./docs/evaluate/temporal-cloud/pricing.mdx +./docs/evaluate/temporal-cloud/sla.mdx +./docs/evaluate/temporal-cloud/legacy-pricing.mdx +./docs/evaluate/temporal-cloud/security.mdx +./docs/production-deployment/cloud/metrics/reference.mdx +./docs/production-deployment/cloud/gcp-export-gcs.mdx +./docs/production-deployment/cloud/audit-logging.mdx +./docs/production-deployment/cloud/multi-region.mdx +./docs/production-deployment/cloud/tcld/namespace.mdx +./docs/production-deployment/cloud/nexus/index.mdx +./docs/production-deployment/cloud/terraform-provider.mdx +./docs/production-deployment/cloud/service-health.mdx + + + + + + + + +## Multi-region Namespace SLA {#sla} + +**What guarantees does Temporal offer for multi-region Namespaces?** + +Multi-region Namespaces offer 99.99% availability, enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla). +Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved. + +Our recovery point objective ([RPO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Point_Objective)) is near-zero. +There may be a short period of time during an incident or forced failover when some data is unavailable in the standby region. +Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile. + +Temporal Cloud proactively responds to incidents by triggering failovers. +Our recovery time objective ([RTO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Time_Objective)) is 20 minutes or less per incident. + +:::info + +During a disaster scenario in which the data on the hard drives in the active region cannot be recovered, the duration of data loss may be as high as the [replication lag](/cloud/multi-region#replication-lag) at the time of disaster. + +::: + +### Regional availability {#regional-availability} + +Multi-region Namespaces are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions). + +:::tip + +Namespace pairing is currently limited to regions within the same continent. +South America is excluded as only one region is available. + +::: + +## Architecture {#architecture} + +**How do multi-region Namespaces work?** + +Multi-region Namespaces replicate Namespace metadata and Workflow Executions across connected regions. +This redundancy, plus the added failover capability, provides measurable stability when dealing with outages. + +A multi-region Namespace is normally active in a single region at any moment. +The passive region assumes a standby role. +An exception to this only occurs in the event of a network partition. +In this case, you may elect to promote a standby region to active status. +Caution: this action will temporarily result in both regions being active. +Once the network partition resolves and communication between the regions is restored, a conflict resolution algorithm determines which region continues as the active one. +This ensures only one region remains active. + +### Metadata replication {#metadata-replication} + +Updates to multi-region Namespace records automatically replicate across regions. +This metadata includes configurations such as retention periods, Search Attributes, and other settings. +Temporal Cloud ensures that all regions will eventually share a consistent and unified view of the Namespace metadata. + +:::info + +A Namespace failover, which changes the "active region" field of a Namespace record, is an update. +This update is replicated via the Namespace metadata mechanism. + +::: + +### Workflow Execution replication {#workflow-execution-replication} + +Temporal Cloud restricts certain Workflow operations to the active region: + +- You may only update Workflows in the active region. +- You may only dispatch Workflow Tasks and Activity Tasks from the active region. Forward progress in a Workflow Execution can therefore only be made in the active region. + +These limits mean that certain requests, such as Start Workflow and Signal Workflow, are processed by and limited to the active region. +Standby regions may receive API requests from Clients and Workers. +They automatically forward these requests to the active Namespace for execution. + +Multi-region Namespaces provide an “all-active” experience for Temporal users. +This helps limit or eliminate downtime during Namespace failover. +There's a short time window from when a standby region becomes the active region to when Clients and Workers receive a DNS update. +During this time requests forward from the now passive (formerly active) region to the newly active (formerly standby) region. + +As Workflow Executions progress and are operated on, replication tasks created in the active region are dispatched to the standby region. +Processing these replication tasks ensures that the standby region undergoes the same state transitions as the active region. +This enables replicated tasks to synchronize and achieve the same state as the original tasks. + +Standby regions do not distribute Workflow or Activity Tasks. +Instead, they perform verification tasks to confirm that intended operations are executed so Workflows reach the desired state. +This mechanism ensures consistency and reliability in the replication process across Temporal regions. + +### Conflict Resolution {#conflict-resolution} + +Multi-region Namespaces rely on asynchronous event replication across Temporal regions. +In the event of a non-graceful failover, replication lag may result in a temporary setback in workflow progress. + +Single-region Namespaces can be configured to provide _at-most-once_ semantics for Activities execution (when [Maximum Attempts](https://docs.temporal.io/retry-policies#maximum-attempts) is set to 0). +Multi-region Namespaces provide _at-least-once_ semantics for execution of Activities. +Completed Activities _may_ be re-dispatched in a newly active region, leading to repeated executions. + +When a Workflow Execution is updated in a new region following a failover, events from the previously active region that arrive after the failover can't be directly applied. +At this point, Temporal Cloud has forked the Workflow History. + +After failover, Temporal Cloud creates a new branch history for execution, and begins its conflict resolution process. +The Temporal Service ensures that Workflow Histories remain valid and are replayable by SDKs post-failover or after conflict resolution. +This capability is crucial for Workflow Executions to continue their forward progress. + +:::warning + +Design your activities to succeed once and only once. +This "idempotent" approach avoids process duplication that could withdraw money twice or ship extra orders by mistake. +Run-once actions maintain data integrity and prevent costly errors. +Idempotency keeps operations from producing additional effects. +Protect your processes from accidental or repeated actions for more reliable execution. + +::: + +## Pricing {#pricing} + +**How does adding a multi-region Namespace affect my costs?** + +For pricing details, visit Temporal Cloud's [Pricing page](/cloud/pricing). + +## Manage your multi-region Namespace {#management} + +**How do you create, enable, and manage your multi-region Namespace?** + +Temporal enables you to create and manage your multi-region Namespace using the Temporal Cloud Web UI, the command line 'tcld' CLI utility, and the [Cloud Ops API](/ops). +Use these tools to create, upgrade, and discontinue your multi-region Namespace. + +- [Create a multi-region Namespace](/cloud/multi-region#create) +- [Upgrade a single-region Namespace to multi-region](/cloud/multi-region#add-regions) +- [Discontinuing multi-region service](/cloud/multi-region#discontinuing) + +:::warning + +Only Account Owner and Global Admin [roles](/cloud/users#account-level-roles) and [Namespace Admins](https://docs.temporal.io/cloud/users#namespace-level-permissions) may create an multi-region Namespace (MRN), upgrade an existing Namespace to MRN, or trigger an MRN failover. + +::: + +:::info Support, stability, and dependency info + +Temporal Cloud’s Terraform provider does not support multi-region Namespaces. + +::: + +### Create a multi-region Namespace {#create} + +The following sections explain how to create a new multi-region Namespace (MRN). +MRNs provide multi-region deployment backed by Temporal's data replication and active-standby features. + +:::tip + +While reading through this coverage, remember that pairing is currently limited to regions within the same continent. + +::: + +#### Temporal Cloud Web UI + +During Namespace creation, specify the first region for the Namespace. +Then, select the “Add a region” option. +Adding a second region enables multi-region Namespace capabilities. + +#### Temporal 'tcld' CLI + +Start with the following command to create the new multi-region Namespace: + +``` +tcld namespace create \ + --namespace . \ + --region +``` + +Include both regions by specifying the [region codes](/cloud/service-availability) as arguments to the `--region` flags. +Before pressing return, add your authentication credentials. For example, `--ca-certificate-file `. + + diff --git a/sidebars.js b/sidebars.js index 5ba006c643..b95990e66f 100644 --- a/sidebars.js +++ b/sidebars.js @@ -338,6 +338,22 @@ module.exports = { ], }, "production-deployment/cloud/multi-region", + { + type: "category", + label: "High-availability Namespaces", + collapsed: true, + link: { + type: "doc", + id: "production-deployment/cloud/high-availability/index", + }, + items: [ + "production-deployment/cloud/high-availability/enable", + "production-deployment/cloud/high-availability/how-it-works", + "production-deployment/cloud/high-availability/operations", + "production-deployment/cloud/high-availability/pricing", + "production-deployment/cloud/high-availability/faq", + ], + }, { type: "category", label: "Temporal Nexus",