Skip to content

Commit

Permalink
Merge pull request #173 from seqeralabs/fusion-docs-overhaul
Browse files Browse the repository at this point in the history
Platform Fusion docs overhaul
  • Loading branch information
llewellyn-sl authored Sep 19, 2024
2 parents 571821a + 125f62c commit 093a5a8
Show file tree
Hide file tree
Showing 7 changed files with 384 additions and 332 deletions.
2 changes: 1 addition & 1 deletion fusion_docs/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Fusion requires a license for use beyond limited testing and validation within S

Traditionally, pipeline developers needed to bundle utilities in containers to copy data in and out of S3 storage.

With Fusion, there is nothing to install or manage.The Fusion thin client is automatically installed using Wave's container augmentation facilities, enabling containerized applications to read and write to S3 buckets as if they were local storage.
With Fusion, there is nothing to install or manage. The Fusion thin client is automatically installed using Wave's container augmentation facilities, enabling containerized applications to read and write to S3 buckets as if they were local storage.

### No shared file system required

Expand Down
237 changes: 133 additions & 104 deletions platform_versioned_docs/version-24.1/compute-envs/aws-batch.mdx

Large diffs are not rendered by default.

83 changes: 61 additions & 22 deletions platform_versioned_docs/version-24.1/compute-envs/azure-batch.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -136,41 +136,59 @@ Create a Batch Forge Azure Batch compute environment:
1. Enter a descriptive name, e.g., _Azure Batch (east-us)_.
1. Select **Azure Batch** as the target platform.
1. Choose existing Azure credentials or add a new credential. If you are using existing credentials, skip to step 7.

:::tip
You can create multiple credentials in your Seqera environment.
:::

1. Enter a name for the credentials, e.g., _Azure Credentials_.
1. Add the **Batch account** and **Blob Storage** account names and access keys.
1. Select a **Region**, e.g., _eastus_.
1. In the **Pipeline work directory** field, enter the Azure blob container created previously, e.g., `az://towerrgstorage-container/work`.

:::note
When you specify a Blob Storage bucket as your work directory, this bucket is used for the Nextflow [cloud cache](https://www.nextflow.io/docs/latest/cache-and-resume.html#cache-stores) by default. You can specify an alternative cache location with the **Nextflow config file** field on the pipeline [launch](../launch/launchpad.mdx#launch-form) form.
:::

1. Select **Enable Wave containers** to facilitate access to private container repositories and provision containers in your pipelines using the Wave containers service. See [Wave containers][wave-docs] for more information.
1. Select **Enable Fusion v2** to allow access to your Azure Blob Storage data via the [Fusion v2][nf-fusion-docs] virtual distributed file system. This speeds up most data operations. The Fusion v2 file system requires Wave containers to be enabled. See [Fusion file system](../supported_software/fusion/fusion.mdx) for configuration details.
1. Select **Enable Fusion v2** to allow access to your Azure Blob Storage data via the [Fusion v2][fusion-docs] virtual distributed file system. This speeds up most data operations. The Fusion v2 file system requires Wave containers to be enabled. See [Fusion file system](../supported_software/fusion/fusion.mdx) for configuration details.

<details>
<summary>Use Fusion v2</summary>

:::note
The compute recommendations below are based on internal benchmarking performed by Seqera. Benchmark runs of [nf-core/rnaseq](https://github.com/nf-core/rnaseq) used profile `test_full`, consisting of an input dataset with 16 FASTQ files and a total size of approximately 123.5 GB.
:::

Azure virtual machines include fast SSDs and require no additional storage configuration for Fusion. For optimal performance, use VMs with sufficient local storage to support Fusion's streaming data throughput.

1. Use Seqera Platform version 23.1 or later.
1. Use an Azure Blob storage container as the pipeline work directory.
1. Enable **Wave containers** and **Fusion v2**.
1. Select the **Batch Forge** config mode.
1. Specify suitable VM sizes under **VMs type**. A `Standard_E16d_v5` VM or larger is recommended for production use.

:::tip
We recommend selecting machine types with a local temp storage disk of at least 200 GB and a random read speed of 1000 MBps or more for large and long-lived production pipelines. To work with files larger than 100 GB, increase temp storage accordingly (400 GB or more).

The suffix `d` after the core number (e.g., `Standard_E16*d*_v5`) denotes a VM with a local temp disk. Select instances with Standard SSDs — Fusion does not support Azure network-attached storage (Premium SSDv2, Ultra Disk, etc.). Larger local storage increases Fusion's throughput and reduces the chance of overloading the machine. See [Sizes for virtual machines in Azure](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/overview) for more information.
:::

</details>

1. Set the **Config mode** to **Batch Forge**.
1. Enter the default **VMs type**, depending on your quota limits set previously. The default is _Standard_D4_v3_.
1. Enter the **VMs count**. If autoscaling is enabled (default), this is the maximum number of VMs you wish the pool to scale up to. If autoscaling is disabled, this is the fixed number of virtual machines in the pool.
1. Enable **Autoscale** to scale up and down automatically, based on the number of pipeline tasks. The number of VMs will vary from **0** to **VMs count**.
1. Enable **Dispose resources** for Seqera to automatically delete the Batch pool if the compute environment is deleted on the platform.
1. Select or create [**Container registry credentials**](../credentials/azure_registry_credentials.mdx) to authenticate a registry (used by the [Wave containers](https://www.nextflow.io/docs/latest/wave.html) service). It is recommended to use an [Azure Container registry](https://azure.microsoft.com/en-gb/products/container-registry) within the same region for maximum performance.
1. Apply [**Resource labels**](../resource-labels/overview.mdx). This will populate the **Metadata** fields of the Azure Batch pool.
1. Expand **Staging options** to include:
- Optional [pre- or post-run Bash scripts](../launch/advanced.mdx#pre--post-run-scripts) that execute before or after the Nextflow pipeline execution in your environment.
- Global Nextflow configuration settings for all pipeline runs launched with this compute environment. Configuration settings in this field override the same values in the pipeline Nextflow config file.
1. Expand **Staging options** to include optional [pre- or post-run Bash scripts](../launch/advanced.mdx#pre-and-post-run-scripts) that execute before or after the Nextflow pipeline execution in your environment.
1. Specify custom **Environment variables** for the **Head job** and/or **Compute jobs**.
1. Configure any advanced options you need:

- Use **Jobs cleanup policy** to control how Nextflow process jobs are deleted on completion. Active jobs consume the quota of the Azure Batch account. By default, jobs are terminated by Nextflow and removed from the quota when all tasks succesfully complete. If set to _Always_, all jobs are deleted by Nextflow after pipeline completion. If set to _Never_, jobs are never deleted. If set to _On success_, successful tasks are removed but failed tasks will be left for debugging purposes.
- Use **Token duration** to control the duration of the SAS token generated by Nextflow. This must be as long as the longest period of time the pipeline will run.

1. Select **Add** to finalize the compute environment setup. It will take a few seconds for all the resources to be created before the compute environment is ready to launch pipelines.

**See [Launch pipelines](../launch/launchpad.mdx) to start executing workflows in your Azure Batch compute environment.**
:::info
See [Launch pipelines](../launch/launchpad.mdx) to start executing workflows in your Azure Batch compute environment.
:::

## Manual

Expand All @@ -186,27 +204,46 @@ Your Seqera compute environment uses resources that you may be charged for in yo
1. Enter a descriptive name for this environment, e.g., _Azure Batch (east-us)_.
1. Select **Azure Batch** as the target platform.
1. Select your existing Azure credentials or select **+** to add new credentials. If you choose to use existing credentials, skip to step 7.

:::tip
You can create multiple credentials in your Seqera environment.
:::

1. Enter a name, e.g., _Azure Credentials_.
1. Add the **Batch account** and **Blob Storage** credentials you created previously.
1. Select a **Region**, e.g., _eastus (East US)_.
1. In the **Pipeline work directory** field, add the Azure blob container created previously, e.g., `az://towerrgstorage-container/work`.

:::note
When you specify a Blob Storage bucket as your work directory, this bucket is used for the Nextflow [cloud cache](https://www.nextflow.io/docs/latest/cache-and-resume.html#cache-stores) by default. You can specify an alternative cache location with the **Nextflow config file** field on the pipeline [launch](../launch/launchpad.mdx#launch-form) form.
:::
1. Select **Enable Wave containers** to facilitate access to private container repositories and provision containers in your pipelines using the Wave containers service. See [Wave containers][wave-docs] for more information.
1. Select **Enable Fusion v2** to allow access to your Azure Blob Storage data via the [Fusion v2][fusion-docs] virtual distributed file system. This speeds up most data operations. The Fusion v2 file system requires Wave containers to be enabled. See [Fusion file system](../supported_software/fusion/fusion.mdx) for configuration details.

<details>
<summary>Use Fusion v2</summary>

:::note
The compute recommendations below are based on internal benchmarking performed by Seqera. Benchmark runs of [nf-core/rnaseq](https://github.com/nf-core/rnaseq) used profile `test_full`, consisting of an input dataset with 16 FASTQ files and a total size of approximately 123.5 GB.
:::

Azure virtual machines include fast SSDs and require no additional storage configuration for Fusion. For optimal performance, use VMs with sufficient local storage to support Fusion's streaming data throughput.

1. Use Seqera Platform version 23.1 or later.
1. Use an Azure Blob storage container as the pipeline work directory.
1. Enable **Wave containers** and **Fusion v2**.
1. Specify suitable VM sizes under **VMs type**. A `Standard_E16d_v5` VM or larger is recommended for production use.

:::tip
We recommend selecting machine types with a local temp storage disk of at least 200 GB and a random read speed of 1000 MBps or more for large and long-lived production pipelines. To work with files larger than 100 GB, increase temp storage accordingly (400 GB or more).

The suffix `d` after the core number (e.g., `Standard_E16*d*_v5`) denotes a VM with a local temp disk. Select instances with Standard SSDs — Fusion does not support Azure network-attached storage (Premium SSDv2, Ultra Disk, etc.). Larger local storage increases Fusion's throughput and reduces the chance of overloading the machine. See [Sizes for virtual machines in Azure](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/overview) for more information.
:::

</details>

1. Set the **Config mode** to **Manual**.
1. Enter the **Compute Pool name**. This is the name of the Azure Batch pool you created previously in the Azure Batch account.

:::note
The default Azure Batch implementation uses a single pool for head and compute nodes. To use separate pools for head and compute nodes (e.g., to use low-priority VMs for compute jobs), see [this FAQ entry](../faqs.mdx#azure).
:::

1. Enter a user-assigned **Managed identity client ID**, if one is attached to your Azure Batch pool. See [Managed Identity](#managed-identity) below.
1. Apply [**Resource labels**](../resource-labels/overview.mdx). This will populate the **Metadata** fields of the Azure Batch pool.
1. Expand **Staging options** to include:
Expand All @@ -218,21 +255,23 @@ Your Seqera compute environment uses resources that you may be charged for in yo
- Use **Token duration** to control the duration of the SAS token generated by Nextflow. This must be as long as the longest period of time the pipeline will run.
1. Select **Add** to complete the compute environment setup. The creation of resources will take a few seconds, after which you can launch pipelines.

:::info
See [Launch pipelines](../launch/launchpad.mdx) to start executing workflows in your Azure Batch compute environment.
:::

### Managed identity

Nextflow can authenticate to Azure services using a managed identity. This method offers enhanced security compared to access keys, but must run on Azure infrastructure.

When you use a manually configured compute environment with a managed identity attached to the Azure Batch Pool, Nextflow can use this managed identity for authentication. However, Platform still needs to use access keys to submit the initial task to Azure Batch to run Nextflow, which will then proceed with the managed identity for subsequent authentication.

1. In Azure, create a user-assigned managed identity. See [Manage user-assigned managed identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/how-manage-user-assigned-managed-identities) for detailed steps. After creation, record the Client ID of the managed identity.
2. The user-assigned managed identity must have the necessary access roles for Nextflow. See [Required role assignments](https://www.nextflow.io/docs/latest/azure.html#required-role-assignments) for more information.
3. Associate the user-assigned managed identity with the Azure Batch Pool. See [Set up managed identity in your batch pool](https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/use-managed-identities-azure-batch-account-pool#set-up-managed-identity-in-your-batch-pool) for more information.
4. When you set up the Platform compute environment, select the Azure Batch pool by name and enter the managed identity client ID in the specified field as instructed above.
1. The user-assigned managed identity must have the necessary access roles for Nextflow. See [Required role assignments](https://www.nextflow.io/docs/latest/azure.html#required-role-assignments) for more information.
1. Associate the user-assigned managed identity with the Azure Batch Pool. See [Set up managed identity in your batch pool](https://learn.microsoft.com/en-us/troubleshoot/azure/hpc/batch/use-managed-identities-azure-batch-account-pool#set-up-managed-identity-in-your-batch-pool) for more information.
1. When you set up the Platform compute environment, select the Azure Batch pool by name and enter the managed identity client ID in the specified field as instructed above.

When you submit a pipeline to this compute environment, Nextflow will authenticate using the managed identity associated with the Azure Batch node it runs on, rather than relying on access keys.

**See [Launch pipelines](../launch/launchpad.mdx) to start executing workflows in your Azure Batch compute environment.**

[az-data-residency]: https://azure.microsoft.com/en-gb/explore/global-infrastructure/data-residency/#select-geography
[az-batch-quotas]: https://docs.microsoft.com/en-us/azure/batch/batch-quota-limit#view-batch-quotas
[az-vm-sizes]: https://learn.microsoft.com/en-us/azure/virtual-machines/sizes
Expand All @@ -246,4 +285,4 @@ When you submit a pipeline to this compute environment, Nextflow will authentica
[az-create-storage]: https://portal.azure.com/#create/Microsoft.StorageAccount-ARM

[wave-docs]: https://docs.seqera.io/wave
[nf-fusion-docs]: https://www.nextflow.io/docs/latest/fusion.html
[fusion-docs]: https://docs.seqera.io/fusion
Loading

0 comments on commit 093a5a8

Please sign in to comment.